Collaborative project pushes discovery in humanities, computer sciences

“Visualizing English Print” uses Mellon Foundation funding to advance our understanding of literature and data visualization

There are hundreds of millions of books in the world, a collection so stupendously large that even the most well-read among us can’t hope to make a dent.

This means our ability to make connections and spot trends across the literary landscape is inherently limited. Sure, we can pick out ideas from a particular slice of the print archive — the qualities that typify a Shakespearean comedy, for example. But what about comparing, say, “The Taming of the Shrew” with every other text penned over the following 150 years?

Impossible — but maybe not for much longer.

Thanks to the work of an interdisciplinary team of researchers at the University of Wisconsin-Madison, we may soon have tools that transform our understanding of English literature.

“People like to talk in humanities disciplines about print culture. We’re actually trying to find out what print culture really was — not how you can generalize about an idea of what was in print based on the books you’ve read,” says Associate Professor of English Robin Valenza, the principal investigator of the “Visualizing English Print, 1470-1800” project, an endeavor that links the humanities and computer sciences in mutual discovery.

TextViewer, one of several software programs used in the VEP project, uses the words of a document as a scaffold for displaying the output of statistical classifiers.

TextViewer, one of several software programs used in the VEP project, uses the words of a document as a scaffold for displaying the output of statistical classifiers. In this screenshot of “Romeo and Juliet,” words that typically appear in Shakespeare’s comedies are highlighted in blue, while those that do not are highlighted in red. The overview on the right shows that the current passage (the famous balcony scene) represents a spike of “comedic” language within the play. Click the image to see a larger version.

Professor of Computer Sciences Michael Gleicher is a co-principal investigator on the project, along with Jonathan Hope, a professor of English at the University of Strathclyde, Glasgow. “Visualizing English Print” began when former UW-Madison English Professor Michael Witmore, now the director of the Folger Shakespeare Library and a senior academic researcher on the project, came to Gleicher in early 2010 in search of better tools for analyzing Shakespeare texts.

Gleicher, who at the time was predominantly collaborating with biologists on gene sequencing visualization, was intrigued by the prospect of applying his work to text.

“There are a lot of sequences in biology,” Gleicher says. “A book is just a sequence of words.”

But Gleicher didn’t want to limit the scope to just Shakespeare. Neither did Witmore nor Valenza, who took over the lead on the project in early 2011. After all, a large portion of the known surviving books from the English print archive, from the introduction of the printing press in the 1470s to 1800, has been digitized. The group had a vast dataset with which to work.

Their efforts received a considerable boost when they landed a pilot grant of more than $420,000 from the Andrew W. Mellon Foundation in Aug. 2011. The Mellon Foundation has subsequently awarded the project another grant — for $924,000 —starting this July.

When completed, the group’s software will allow users to view trends in the archive as a whole — word usage or the prevalence of a given topic, for example — and then jump into individual texts to see how those trends manifest themselves in specific passages. This will allow scholars to spot patterns that have previously escaped detection, such as when spelling became standardized. According to the group’s data, that happened between 1640 and 1650. Previously, the widely-held belief was that it had occurred in 1755 with the publication of “Johnson’s Dictionary.”

“Our priority in this is to give you more information about individual books as well as giving you the big picture,” Valenza says. “You can read more deeply because each word has more information attached to it.”

That multi-scale nature presents challenges from a computer sciences perspective. So, too, does working with “dirty data” — outdated spellings and low-quality scans, for instance — and the sheer volume of text in the archive. And the software has to be intuitive enough that it doesn’t require a Ph.D. in statistics to make sense of it.

The VEP team used a software program called TextDNA to visualize data from Google Books. This screenshot shows a visualization of word popularity by decade from the 1660s to the 2000s. Each row represents the top 1,000 most popular words from that decade, with decades ordered from top to bottom in ascending chronology. Words are ordered by their relative popularity within each decade (leftmost are the most common for that decade, rightmost are the 1,000th most common) and colored according to their relative popularity in the decade of the 2000s (dark purple is is the most popular in the 2000s, light purple is the 1,000th most popular, and orange words are not in the 1,000 most popular words in the 2000s). Click the image to see a larger version.

This screenshot shows a visualization of word popularity by decade from the 1660s to the 2000s using a program called TextDNA. Each row represents the top 1,000 most popular words from that decade, with decades ordered from top to bottom in ascending chronology. Words are ordered left to right by their relative popularity within each decade and colored according to their relative popularity in the decade of the 2000s (dark purple is the most popular in the 2000s, light purple is the 1,000th most popular, and orange words are not in the 1,000 most popular words in the 2000s). Click the image to see a larger version.

Gleicher decided early on that his lab would need to build new tools for the project, rather than merely modifying existing ones intended for use by mathematicians and scientists. He and his graduate students have created a variety of applications since the project began, but he says they’re now at the point where they have a firm understanding of “the hard questions that we need to be able to answer.”

Eventually, the group plans to make its software and datasets freely available online, and the tools could potentially be used to analyze other texts — news stories, perhaps, or even tweets. The first step toward that could come later this year, when the group hopes to begin releasing its material to the general public.

It will be a culmination of a truly collaborative, interdisciplinary effort.

“This is not us building tools for them. This is us learning from each other,” Gleicher says. “It really is about us infusing each other with new ways of thinking.”

Leave a Reply

To comment or view this article online, see http://news.ls.wisc.edu/humanities-the-arts/collaborative-project-pushes-discovery-in-humanities-computer-sciences/