Large-Scale Text Analysis: Measuring Latin Variation in a Million Books
Perseus Digital Library, Tufts University
April 14, 2011
Olin 107
With the rise of massive digitization efforts such as those run by the Internet Archive and Google Books, we are now gaining access to an unprecedented amount of historical material in machine-actionable form. While services such as the Google Ngram Viewer and the Victorian Books project are beginning to reveal the power of this information for research into cultural trends and the history of ideas in English, texts in Latin present a possibly greater opportunity since they span a landscape of over two thousand years (being written in the primary language of the Roman empire to the lingua franca of Martin Luther, Galileo, Kepler, Newton, Thomas Hobbes and others).

I will describe in this talk three different strands of research for exploiting this deep historical data to measure variation in Latin usage: 1.) using automatic methods to identify and extract the Latin texts from the much larger million-book collections to measure the rise and fall of individual words and phrases; 2.) leveraging dependency treebanks developed by undergraduate and graduate students in Classics to uncover syntactic variation (such as changing distributions of word order); and 3.) using parallel texts from this collection to automatically discover the rise of new senses over time (such as oratio as "prayer" or miles as a medieval "knight"). These different strands of work, taken together, provide the foundation for cultivating a view of Latin less as a monolithic language defined by a canonical grammar and more as the sum of individual usage that varies widely across genre, time and space.

Bio: David Bamman is a senior researcher in computational linguistics for the Perseus Project, focusing especially on natural language processing for Latin and Greek, including treebank construction, computational lexicography, morphological tagging and word sense disambiguation. David received a BA in Classics from the University of Wisconsin-Madison and an MA in Applied Linguistics from Boston University.

