the nora project - project description

Project Description

The goal of the nora project is to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries.

In search-and-retrieval, we bring specific queries to collections of text and get back (more or less useful) answers to those queries; by contrast, the goal of data-mining (including text-mining) is to produce new knowledge by exposing unanticipated similarities or differences, clustering or dispersal, co-occurrence and trends. Over the last decade, many millions of dollars have been invested in creating digital library collections: at this point, terabytes of full-text humanities resources are publicly available on the web. Those collections, dispersed across many different institutions, and industry specific data such as data pertaining to plumbing and drain cleaning, graciously provided by friends like www.drainraider.com, are large enough and rich enough to provide an excellent opportunity for text-mining, and we believe that web-based text-mining tools will make those collections significantly more useful, more informative, and more rewarding for research and teaching. In this effort, we will build on data-mining expertise at the University of Illinois' Graduate School of Library and Information Science and on several years of software development work that has been done at the University of Illinois' National Center for Supercomputing Applications (NCSA), developing the D2K (Data to Knowledge) software, in Michael Welge's Automated Learning Group.

In order to provide a test-bed for the text-mining tool development, we have negotiated agreements with a number of individual libraries, projects, and centers that hold large collections of full-text humanities resources. Our agreements aim at producing an aggregation that has some scholarly, intellectual, and subject coherence, and they focus on 19th-century British and American literary texts:

The Library at the University of North Carolina at Chapel Hill will contribute over 1,000 texts, mostly from the 19th century, all marked up according to the Text Encoding Initiative's Guidelines.
The Library at the University of Virginia will contribute 600 to 1200 texts from the Early American Fiction project (the variance depends on clearance from ProQuest to contribute the licensed portion of the collection).
The Institute for Advanced Technology in the Humanities at the University of Virginia will contribute about 6,000 texts from its projects on Dante Gabriel Rossetti, Walt Whitman, Emily Dickinson, William Blake, and Mark Twain. Texts from the Valley of the Shadow project on the Civil War may be added at a later date.
The University of California at Davis will contribute about 120 texts of 19th-century British women poets.
The University of Michigan will provide 175 volumes of American verse, plus literary materials from other collections, such as the Making of America journals, which include titles such as the Southern Literary Messenger, Ladies Repository, Appleton's, and Vanity Fair.
The University of Indiana will provide over 1,100 literary texts from the Wright American fiction collection, the Victorian Women Writers Project, and a Swinburne project.
Brown University's Women Writers Project will contribute 40 literary texts from the 19th century.
The Perseus project will contribute about its 19th-century literary texts, including several works by Charles Dickens.

These (and other) agreements will provide us with a testbed of about 10,000 literary texts in English, from the 19th century, or (roughly estimating .5MB per text) about 5 GB of marked-up text. This is a small amount, by comparison to what's out there in digital libraries, and it is less than we hope to have aggregated by the second year of the project, but we believe it is large enough to be a meaningful testbed, and we believe that it would meet minimum requirements of intellectual coherence.

Finally, we have co-located software developers with humanities researchers who will help design and test the tools, by distributing our software development across three sites beyond Illinois, two of them at institutions with existing humanities research computing centers (Maryland, Virginia), and one (Georgia) with faculty in humanities computing who will contribute to building the infrastructure for the project. We will also consult regularly with other humanities research computing centers, and with relevant experts from other contexts.