The Digging into Data Challenge aims to address how “big data” changes the research landscape for the humanities and social sciences. In particular, the four goals of the initiative are:
- to promote the development and deployment of innovative research techniques in large-scale data analysis that focus on applications for the humanities and social sciences;
- to foster interdisciplinary collaboration among researchers in the humanities, social sciences, computer sciences, library, archive, information sciences, and other fields, around questions of text and data analysis;
- to promote international collaboration among both researchers and funders;
- to ensure efficient access to and sharing of the materials for research by working with data repositories that hold large digital collections.
The Challenge is currently in its third round but reports from the projects funded in round two are available for download from the Jisc repository. This post is a summary of those projects, with a UK partner institution, extracted from their final reports. A link to each final report, in the Repository, is provided after each project’s summary.
Cascades, Islands, or Streams?
The objective of this project was to create and examine large-scale heterogeneous datasets to increase understanding of the scholarly communication system, to identify and analyse various scholarly activities for creating and disseminating new knowledge, and further develop the innovative computer software developed at University of Wolverhampton to collect, filter and analyse data from the web and social media to discover trends in science and in scholarly communication. The results from the project present an argument that transformations in the scholarly communication system affect not only how scholars interact, but also the very substance of these communications, at least in some cases as the audience for the communications is no longer just other researchers but the general public.
ChartEx research focussed on the extraction of information from charters using a combination of natural language processing (NLP) and data mining (DM) to establish entities such as locations and related actors, events and dates. The third crucial component of the ChartEx Project was the use of novel instrumental interaction techniques to design a virtual workbench (VWB) that will allow researchers to both refine the processing of the NLP and DM, and to directly manipulate (visualise, confirm, correct, refine, augment, hypothesise) relationships extracted from the set of charters to gain new insights about the entities contained within them.
According to the project team, “Working on the DiggiCORE project was a truly amazing experience.” Its goal was to aggregate, at the level of both metadata and content, a vast set of research publications, from institutional repositories, archives (green OA route) and journals (gold OA route) worldwide, and provide novel tools for automatic enrichment of this content with relationships (relatedness, citations). The project provided the following outputs:
- A software infrastructure delivered to users as a free web service and as a downloadable dataset that enables the analysis of the behaviour of research communities in the Open Access domain;
- New knowledge and understanding resulting from the data analysis.
Digging by Debating
The Digging by Debating project aimed to extract, model, map and visualise argument from a Big Data repository, such as the Hathi Trust Digital Library. It tackled a very ambitious and complex problem, of linking macro visual views of science-philosophy data and state-of-the-art topic modelling and searching to semantically rich analysis and processing (based on argument) of the data. It made significant steps forward in these areas and their interconnection, and produced a constellation of loosely integrated tools and methodologies in this respect. Ultimately their efforts show how computational humanities and linguistics can bridge the gulf between the “big data” perspective of first-generation digital humanities and the close readings and critical interpretations of text that are the “bread and butter” of more traditional scholarship.
Digging into Metadata
This project aimed to closely examine the metadata associated with the chosen datasets and enhance that metadata through a variety of automatic, scalable techniques which built on previous collaborative work. Through this enhanced metadata the intention was to enable improved search capability over disparate digital libraries which had hugely varying levels and standards of subject metadata and which would previously have been difficult to search in a consistent way. Through this work they aimed to show firstly that their techniques could enhance poor or inconsistent metadata in a meaningful and consistent way and secondly that this enhanced metadata could lead to improved search functionality which would add value for end users.
ELVIS stands for Electronic Locator of Vertical Interval Successions and is a large data-driven research project on musical style. The central unifying concept of the ELVIS project was to study counterpoint: the way combinations of voices in polyphonic music (e.g. the soprano and bass voices in a hymn, or the viola and cello in a string quartet, as well as combinations of more than two voices) interact: i.e. what are the permissible vertical intervals (notes from two voices sounding at the same time) for a particular period, genre, or style. These vertical intervals, connected by melodic motions in individual voices, constitute Vertical Interval Successions. In more modern terms, this could be described as harmonic progressions of chords, but what made ELVIS particularly flexible was its ability to bridge the gap to earlier, contrapuntally-conceived music by using the diad (a two-note combination) rather than the triad (a combination of three notes in particular arrangements) as a basis (since triads and beyond may be expressed as sums of diads). Existing data, while numerous, were somewhat messy, with many duplications, errors, and gaps in certain areas of music history, so one task was the consolidation and cleaning-up of the data both by hand and with newly developed error-correction software.
Altogether, the ELVIS project has enabled not only the consolidation of data and toolsets, but the creation of concrete research output on a previously difficult level. The resulting databank and tools available through the main website at McGill University will prove an invaluable resource to musicologists in this field in the years to come.
Imagery Lenses for Visualising Text Corpora
Their team of computer scientists, a linguist, and poet/scholars from the University of Oxford and the University of Utah have been working to create, through computation and visualisation, a richer understanding of how poems work: one that relies on computational tools yet embraces qualitative and quantitative components and explicitly engages human readers and perspectives and research needs specific to the humanities in general and to literature, especially poetry, in particular. This new tool, PoemViewer, by approaching poems as complex dynamic systems, represents a significant step toward providing literary scholars freedom to explore individual poems, bodies of poetry, and other texts of their choosing in ways traditional scholarship and other text analysis software cannot. In addition to displaying familiar poetic features, such as texts, word frequencies, grammatic classes, and sentiment, Poem Viewer provides a unique capability for visualizing poetic sound, including various sonic relationships and changes as they occur in a poem over time.
Integrating Data Mining and Data Management Technologies for Scholarly Inquiry
Research on integrating digital library content with computational tools and services has been concerned with examining, analysing, and finding patterns within a data set. Scholars, on the other hand, associate the people, places and events mentioned in texts with other descriptions elsewhere. Thus, while most computational analysis looks inward to the contexts of a particular set of data, scholars tend to look outward, seeking the context for the texts they are studying.
This project went beyond this basic analysis by providing a prototype system developed to provide expert system support to scholars in their work. This system integrated large-scale collections including JSTOR and the books collections of the Internet Archive stored and managed in a distributed preservation environment. It also incorporated text mining and Natural Language Processing software capable of generating dynamic links to related resources discussing the same persons, places, and events.
The purpose of the ISHER (Integrated Social History Environment for Research) project has been to apply automated text mining methods to large historical data sources, to demonstrate how these can result in a transformation of the working methods of the researcher, providing an accurate and efficient means of locating and exploring information of interest, with the minimum effort. The project has had a particular focus on the detection of information relating to social unrest, although some of the systems and methods described are more widely applicable to other information of interest to social historians. The project partners have applied sophisticated text mining methods to digitised collections relating to news, i.e., the New York Times (NYT) archive and the National Library of the Netherlands (KB) daily Dutch newspapers archive, together with news reports and related discussions comprising the Automatic Content Extraction (ACE) 2005 Evaluation corpus.
A concrete demonstration of the overall success of the project in achieving its goals comes in the form of two fully functional Web-based interfaces, providing access to the above archives. Each of these interfaces provides users with sophisticated features for searching and browsing the collections, based on the output of text mining analyses. The interfaces are a NYT Search interface and an interface that visualises and links strikes. Work was also carried out in relation to existing text mining frameworks being used by the partners to increase interoperability of components, which enabled, for example, UIUC natural language processing tools to be composed with NaCTeM tools in workflows to process NYT data.
This project investigates levels of social mobility in Canada, Great Britain and the United States from 1850 to 1911.
- It uses census records from the 1850s, 1880s, and 1910s to create two panels of men observed in childhood living with their father, and then thirty years later in adulthood.
- It measures social mobility by comparing fathers’ and sons’ occupations at similar points in their lives.
Further information is available from their project website – http://www.miningmicrodata.org/. The final report will be available once the project completes in January 2015.
Scholars interested in nineteenth-century global economic history face a voluminous historical record. Conventional approaches to primary source research on the economic and environmental implications of globalised commodity flows typically restrict researchers to specific locations or a small handful of commodities. By taking advantage of cutting-edge computational tools, the project was able to address much larger data sets for historical research, and thereby provides historians with the means to develop new data-driven research questions. In particular, this project has demonstrated that text mining techniques applied to tens of thousands of documents about nineteenth-century commodity trading can yield a novel understanding of how economic forces connected distant places all over the globe and how efforts to generate wealth from natural resources impacted on local environments.
The large-scale findings that result from the application of these new methodologies would be barely feasible using conventional research methods. Moreover, the project vividly demonstrates how the digital humanities can benefit from trans-disciplinary collaboration between humanists, computational linguists and information visualisation experts.
Further information on all three phases of the Digging into Data Challenge can be found on the programme page on the Jisc website and the main Digging into Data website.