The third Digging into Data challenge funded 9 UK projects, each with an international partner, as stipulated in the funding conditions. This challenge built on the successes of the previous two and Jisc has been involved in all three challenges. In the UK, Digging into Data Challenge 3 projects were managed by Jisc and sponsored by the Arts and Humanities Research Council (AHRC) and Economic Social and Economic Research Council (ESRC).
Each of the 9 funded projects was funded to develop new insights, tools and skills in innovative humanities and social research using large-scale data analysis. In particular, they investigated how computational techniques can be applied to “big data” to change the nature of humanities and social sciences research.
This post summarises the final reports from each of the UK projects.
Automating Data Extraction from Chinese Texts
Lead institution: University of Birmingham
This project was designed as an international and interdisciplinary collaboration (between the UK, US and the Netherlands) to facilitate and promote research techniques for large-scale structured datasets derived from unstructured corpora of Chinese texts. It aimed to reduce ambiguity within and across entity types such as place and personal names and to improve recall and accuracy by developing machine learning approaches.
They have developed the MARKUS tool into a reading and text analysis platform with a wide range of functionality. MARKUS has sped up the data collecting process and so enables the analysis of large datasets, shown its relevance to researchers and students beyond the membership of the original team, and seen a steady growth in the number of users. Due to the range of its functionality and the close modelling of research flows in its design MARKUS has also received a warm welcome by students and researchers in Chinese history, literature, art, science and technology, religious studies, and philosophy.
Commonplace Cultures: Mining Shared Passages in the 18th Century using Sequence Alignment and Visual Analytics
Lead institution: University of Oxford
Commonplacing denotes the thematic organisation of quotations and other passages for later recall and reuse. In other words, two similar sequences in texts are potentially commonplaces. Historically, the 18th century can be seen as one of the last in a long line of commonplace cultures extending from Antiquity through the Renaissance and Early Modern periods. Detecting similarity between texts is a frequently encountered text mining task. Because the measurement of similarity is typically composed of a number of metrics, and some measures are sensitive to subjective interpretation, a generic detector obtained using machine learning often has difficulties balancing the roles of different metrics according to the semantic context exhibited in a specific collection of texts.
The aim of the project was to devise a visual interface to enable users to construct and experiment with different detectors using primitive metrics, in a way similar to constructing an image processing pipeline. The research in this project resulted in a software prototype, ViTA. This provides users with over 40 tools, which are grouped into four categories: Word Matching, Language Processing, Visual Processing, and Operator. A typical pipeline is composed of 3-6 tools. Together with the pixelmap visualisation, the pipeline performs text analysis similar to image processing, which is an important novel contribution resulting from this project.
Digging Archaeology Data: Image Search and Markup (DADAISM)
Lead institution: University of York
The DADAISM project brought together researchers from the diverse fields of archaeology, human computer interaction, image processing, image search and retrieval, and text mining to create a rich interactive system to address the problems of researchers finding images relevant to their research. In the age of digital photography, thousands of images are taken of archaeological artefacts. These images could help archaeologists enormously in their tasks of classification and identification if they could be related to one another effectively. However, these images are currently greatly underutilised for two key reasons. Firstly, the current paradigm for interaction with image collections is basic keyword search or, at best, simple faceted search, thus it is difficult to translate the artefact seen by the researcher into the system. Secondly, even if these interactions are possible, the metadata related to the majority of images of archaeological artefacts is scarce in information relating to the content of the image and the nature of the artefact, and is time intensive to enter manually.
DADAISM has delivered a foundation of research, based on interaction design, image processing and text mining that transforms the way in which archaeologists interact with online image collections. Their prototype system goes well beyond current systems for working with images, and will support archaeologists’ tasks of finding, organising, relating and labelling images as well as other relevant sources of information such as grey literature documents. This prototype will inform future practice with their archaeology partners. The outcomes of this project will provide substantial benefit to the digital cultural heritage community, and provides a platform for further research in the area.
Digging into Linked Parliamentary Data (DiLiPaD)
Lead institution: School of Advanced Studies (UoL)
The goals of the DiLiPaD project were to 1) enhance the existing corpora of parliamentary data from the UK (1803 to the present) and Canada (1867 to the present) to the same standards as the comparable Netherlands corpus covering 1814 to the present; 2) develop new tools for the study of this data; 3) explore research questions, both in conjunction with and following on from the enhancement of the data and the provision of these tools. Specifically the project team investigated gender and politics, focusing on the role of women in parliamentary debate, the framing of same-sex marriage and the measuring of emotion in parliamentary debates.
The project has enabled a robust proof of concept for the methods used, and the enhancement of other countries’ parliamentary proceedings in this way can now be shown to be viable and worthwhile. It has also contributed significantly to the development of the Institute of Historical Research (IHR)’s digital research agenda in two main areas. First, it builds on and advances a long track record of research into parliamentary history; and second, it complements other big data projects at the institute, particularly in its focus on linguistic analysis and the secure identification of individuals and concepts within large corpora. Big data research is now at the heart of the IHR’s research strategy for the next five years, and is written in to all strategic planning documents. The historians working at the DiLiPaD project set an important precedent by publishing a data-mining paper in a mainstream historical journal, Twentieth Century British History.
Digging into signs: Developing standard annotation practices for cross-linguistic quantitative analysis of sign language data
Lead institution: University College London
For sign languages used by deaf communities, linguistic corpora have until recently been unavailable, due to the lack of a writing system and a written culture in these communities, and the very recent advent of digital video. Recent improvements in video and computer technology have now made larger sign language datasets possible; however, large sign language datasets that are fully machine-readable are still elusive. As sign language corpus building progresses, the potential for some standards in annotation is beginning to emerge. But before this project, there were no attempts to standardise these practices across corpora, which is required to be able to compare data cross-linguistically. This project aimed to 1) develop annotation standards for glosses (lexical/word level); 2) test their reliability and validity; 3) improve current software tools that facilitate a reliable workflow. Overall the project aimed not only to set a standard for the whole field of sign language studies throughout the world but also to make significant advances toward two of the world’s largest machine-readable datasets for sign languages – specifically the BSL Corpus (British Sign Language, http://bslcorpusproject.org) and the Corpus NGT (Sign Language of the Netherlands, http://www.ru.nl/corpusngt).
This project has helped raised awareness of sign language corpora, and of sign languages and deaf communities generally, amongst academics who are outside of this field via the Digging into Data Challenge cross project meetings and associated meetings (e.g. AHRC Digital Transformations). The project met its objectives of creating the initial steps towards annotation standards for sign language data at the lexical/word level, testing their reliability and validity, and improving the multimedia annotation tool ELAN.
Lead institution: University of Manchester
The overarching goal of this project was to transform the Biodiversity Heritage Library (BHL), a digital library of over 40 million pages of taxonomic literature, into a next generation social digital resource to facilitate the collaborative study and discussion of legacy biodiversity documents by a worldwide community. In this project, methods for text mining, visualisation and social media analysis were developed to effectively serve BHL users with semantically enriched content.
The resulting digital resource provides access to the full content of BHL library documents via semantically enhanced, interactive browsing and searching capabilities, allowing users to more efficiently locate information of interest to them. This project has instilled a deeper appreciation for text mining in taxonomists, biodiversity informaticians and digital librarians. All of the tools and resources developed as part of the Mining Biodiversity project will be continuously made available to the community via the NaCTeM website.
MIning Relationships Among variables in large datasets from CompLEx systems (MIRACLE)
Lead institution: University of Dundee
Social scientists have used agent-based models (ABMs) to explore the interaction and feedbacks among social agents and their environments. Agent-based models are dynamic computer simulations of human societies and behaviours in which individuals and their interactions are explicitly represented. This bottom-up structure of ABMs enables simulation and investigation of complex systems and their emergent behaviour with a high level of detail. This detail means that such models have a very large number of variables, creating highly multidimensional “big data” that are difficult to analyse using traditional statistical methods, in part because many of the relationships among the variables are nonlinear.
The project addressed this challenge by developing methods and web-based analysis and visualisation tools that provide automated means of discovering complex relationships among variables. The tools enable modellers to easily manage, analyse, visualise, and compare their output data, and provide stakeholders, policy makers and the general public with intuitive web interfaces to explore, interact with otherwise difficult-to-understand models, and insights into the real-world case studies they represent.
Resurrecting Early Christian Lives: Digging in Papyri in a Digital Age
Lead institution: University of Oxford
This project investigated ancient papyri relevant to the rise of early Christianity within the multi-cultural context of Greco-Roman Egypt. Specifically it examined in detail the complex networks of identity and authority and how Christians saw their new religion as part of their other identities (Greek, Egyptian, Roman, merchant, monk). The rich resource of ancient data came from papyrus documents unearthed from the garbage dumps around the outskirts of Bahnasa in Egypt, known in antiquity as Oxyrhynchus. Building on data from the crowd-sourced transcriptions of the Ancient Lives project, they data mined papyri relevant to early Christianity. To increase the range of their dataset, they developed a transcription tool for Coptic, the final stage of the indigenous language of Egypt, notably used by Christians. They implemented a Coptic language version of Ancient Lives, allowing for the crowd-sourced transcription of these poorly known and unpublished texts, and, for Greek and Coptic texts, developed a unique mining tool for the Ancient Lives database.
The Ancient Lives platform, since it is designed to engage a worldwide user base far beyond the confines of academia, has been recognized as a unique and noteworthy project at the University of Oxford; it was even included in Oxford’s ‘Impact Series’, a showcase of significant research projects, and mentioned in the last REF report. Ancient Lives is now an open platform and other institutions with collections will be able to use the site. The data gathered will be made freely available on the project’s GitHub page, which will allow third party developers to interact with the data and possibly generate ideas that the project did not consider.
Trees and Tweets: Mining Billions to Understand Human Migration and Regional Linguistic Variation
Lead institution: Aston University
This project focused on analysing regional lexical variation and change in Modern American and British English through the analysis of multi-billion word corpora of geocoded Twitter data collected between 2013 and 2015. Because these are the largest regional corpora ever compiled, their analysis has led to several significant findings. Most notably, by taking advantage of the massive amounts of data available, they have studied the emergence of new words in more detail than has ever been possible before. In particular, they have developed and applied methods for identifying and mapping new word forms and common sources of lexical innovation in large time-stamped and geo-coded corpora. More generally, their research has shown that the relative frequency of almost all words show clear regional patterns when mapped. This is a surprising result to most people, including linguists, and it challenges standard assumptions about the nature of language variation and change.
To allow both researchers and the general public to access the results and better understand regional lexical variation, they developed a free web application called Word Mapper that allows anyone to map the 10,000 most common words in American English. In addition to providing tools and data, they successfully disseminated their research through numerous journal articles and conference presentations, and this project will continue to generate research for the foreseeable future. Notably, this research has also led to productive interdisciplinary collaborations with geographers, anthropologists, computer scientists, economists, and physicists. It has also proven to be of considerable interest to the general public, with several aspects of the research having been covered by hundreds of news outlets worldwide. For example, the research on the identification of new words, regional patterns in swearing, and alternation in the use of hesitation markers has been covered by some the most important international new sources, including The Guardian, The Telegraph, The New York Times, The Washington Post, Time and Popular Science.
This blog was set up for the duration of the Digging into Data 3 challenge. By summarising the final reports from each of the UK projects it brings the third round of Digging into Data to a close. As manager for the programme, I would like to thank all the people involved with each project for their contribution to a successful challenge and wish them all luck in taking forward the outputs of these projects. I would also like to thank ESRC and AHRC for their work with Jisc on Digging into Data.