A summary of DiD3 final reports from the UK projects

The third Digging into Data challenge funded 9 UK projects, each with an international partner, as stipulated in the funding conditions. This challenge built on the successes of the previous two and Jisc has been involved in all three challenges. In the UK, Digging into Data Challenge 3 projects were managed by Jisc and sponsored by the Arts and Humanities Research Council (AHRC) and Economic Social and Economic Research Council (ESRC).

Each of the 9 funded projects was funded to develop new insights, tools and skills in innovative humanities and social research using large-scale data analysis. In particular, they investigated how computational techniques can be applied to “big data” to change the nature of humanities and social sciences research.

This post summarises the final reports from each of the UK projects.

Automating Data Extraction from Chinese Texts

Lead institution: University of Birmingham
Final report

This project was designed as an international and interdisciplinary collaboration (between the UK, US and the Netherlands) to facilitate and promote research techniques for large-scale structured datasets derived from unstructured corpora of Chinese texts. It aimed to reduce ambiguity within and across entity types such as place and personal names and to improve recall and accuracy by developing machine learning approaches.

They have developed the MARKUS tool into a reading and text analysis platform with a wide range of functionality. MARKUS has sped up the data collecting process and so enables the analysis of large datasets, shown its relevance to researchers and students beyond the membership of the original team, and seen a steady growth in the number of users. Due to the range of its functionality and the close modelling of research flows in its design MARKUS has also received a warm welcome by students and researchers in Chinese history, literature, art, science and technology, religious studies, and philosophy.

Commonplace Cultures: Mining Shared Passages in the 18th Century using Sequence Alignment and Visual Analytics

Lead institution: University of Oxford
Final report

Commonplacing denotes the thematic organisation of quotations and other passages for later recall and reuse. In other words, two similar sequences in texts are potentially commonplaces. Historically, the 18th century can be seen as one of the last in a long line of commonplace cultures extending from Antiquity through the Renaissance and Early Modern periods. Detecting similarity between texts is a frequently encountered text mining task. Because the measurement of similarity is typically composed of a number of metrics, and some measures are sensitive to subjective interpretation, a generic detector obtained using machine learning often has difficulties balancing the roles of different metrics according to the semantic context exhibited in a specific collection of texts.

The aim of the project was to devise a visual interface to enable users to construct and experiment with different detectors using primitive metrics, in a way similar to constructing an image processing pipeline. The research in this project resulted in a software prototype, ViTA. This provides users with over 40 tools, which are grouped into four categories: Word Matching, Language Processing, Visual Processing, and Operator. A typical pipeline is composed of 3-6 tools. Together with the pixelmap visualisation, the pipeline performs text analysis similar to image processing, which is an important novel contribution resulting from this project.

Digging Archaeology Data: Image Search and Markup (DADAISM)

Lead institution: University of York
Final report

The DADAISM project brought together researchers from the diverse fields of archaeology, human computer interaction, image processing, image search and retrieval, and text mining to create a rich interactive system to address the problems of researchers finding images relevant to their research. In the age of digital photography, thousands of images are taken of archaeological artefacts. These images could help archaeologists enormously in their tasks of classification and identification if they could be related to one another effectively.  However, these images are currently greatly underutilised for two key reasons. Firstly, the current paradigm for interaction with image collections is basic keyword search or, at best, simple faceted search, thus it is difficult to translate the artefact seen by the researcher into the system. Secondly, even if these interactions are possible, the metadata related to the majority of images of archaeological artefacts is scarce in information relating to the content of the image and the nature of the artefact, and is time intensive to enter manually.

DADAISM has delivered a foundation of research, based on interaction design, image processing and text mining that transforms the way in which archaeologists interact with online image collections.   Their prototype system goes well beyond current systems for working with images, and will support archaeologists’ tasks of finding, organising, relating and labelling images as well as other relevant sources of information such as grey literature documents.  This prototype will inform future practice with their archaeology partners. The outcomes of this project will provide substantial benefit to the digital cultural heritage community, and provides a platform for further research in the area.

Digging into Linked Parliamentary Data (DiLiPaD)

Lead institution: School of Advanced Studies (UoL)
Final report

The goals of the DiLiPaD project were to 1) enhance the existing corpora of parliamentary data from the UK (1803 to the present) and Canada (1867 to the present) to the same standards as the comparable Netherlands corpus covering 1814 to the present; 2) develop new tools for the study of this data; 3) explore research questions, both in conjunction with and following on from the enhancement of the data and the provision of these tools. Specifically the project team investigated gender and politics, focusing on the role of women in parliamentary debate, the framing of same-sex marriage and the measuring of emotion in parliamentary debates.

The project has enabled a robust proof of concept for the methods used, and the enhancement of other countries’ parliamentary proceedings in this way can now be shown to be viable and worthwhile. It has also contributed significantly to the development of the Institute of Historical Research (IHR)’s digital research agenda in two main areas. First, it builds on and advances a long track record of research into parliamentary history; and second, it complements other big data projects at the institute, particularly in its focus on linguistic analysis and the secure identification of individuals and concepts within large corpora. Big data research is now at the heart of the IHR’s research strategy for the next five years, and is written in to all strategic planning documents. The historians working at the DiLiPaD project set an important precedent by publishing a data-mining paper in a mainstream historical journal, Twentieth Century British History.

Digging into signs: Developing standard annotation practices for cross-linguistic quantitative analysis of sign language data

Lead institution: University College London
Final report

For sign languages used by deaf communities, linguistic corpora have until recently been unavailable, due to the lack of a writing system and a written culture in these communities, and the very recent advent of digital video. Recent improvements in video and computer technology have now made larger sign language datasets possible; however, large sign language datasets that are fully machine-readable are still elusive. As sign language corpus building progresses, the potential for some standards in annotation is beginning to emerge. But before this project, there were no attempts to standardise these practices across corpora, which is required to be able to compare data cross-linguistically. This project aimed to 1) develop annotation standards for glosses (lexical/word level); 2) test their reliability and validity; 3) improve current software tools that facilitate a reliable workflow. Overall the project aimed not only to set a standard for the whole field of sign language studies throughout the world but also to make significant advances toward two of the world’s largest machine-readable datasets for sign languages – specifically the BSL Corpus (British Sign Language, http://bslcorpusproject.org) and the Corpus NGT (Sign Language of the Netherlands, http://www.ru.nl/corpusngt).

This project has helped raised awareness of sign language corpora, and of sign languages and deaf communities generally, amongst academics who are outside of this field via the Digging into Data Challenge cross project meetings and associated meetings (e.g. AHRC Digital Transformations). The project met its objectives of creating the initial steps towards annotation standards for sign language data at the lexical/word level, testing their reliability and validity, and improving the multimedia annotation tool ELAN.

Mining Biodiversity

Lead institution: University of Manchester
Final report

The overarching goal of this project was to transform the Biodiversity Heritage Library (BHL), a digital library of over 40 million pages of taxonomic literature, into a next generation social digital resource to facilitate the collaborative study and discussion of legacy biodiversity documents by a worldwide community. In this project, methods for text mining, visualisation and social media analysis were developed to effectively serve BHL users with semantically enriched content.

The resulting digital resource provides access to the full content of BHL library documents via semantically enhanced, interactive browsing and searching capabilities, allowing users to more efficiently locate information of interest to them. This project has instilled a deeper appreciation for text mining in taxonomists, biodiversity informaticians and digital librarians. All of the tools and resources developed as part of the Mining Biodiversity project will be continuously made available to the community via the NaCTeM website.

MIning Relationships Among variables in large datasets from CompLEx systems (MIRACLE)

Lead institution: University of Dundee
Final report

Social scientists have used agent-based models (ABMs) to explore the interaction and feedbacks among social agents and their environments. Agent-based models are dynamic computer simulations of human societies and behaviours in which individuals and their interactions are explicitly represented. This bottom-up structure of ABMs enables simulation and investigation of complex systems and their emergent behaviour with a high level of detail. This detail means that such models have a very large number of variables, creating highly multidimensional “big data” that are difficult to analyse using traditional statistical methods, in part because many of the relationships among the variables are nonlinear.

The project addressed this challenge by developing methods and web-based analysis and visualisation tools that provide automated means of discovering complex relationships among variables. The tools enable modellers to easily manage, analyse, visualise, and compare their output data, and provide stakeholders, policy makers and the general public with intuitive web interfaces to explore, interact with otherwise difficult-to-understand models, and insights into the real-world case studies they represent.

Resurrecting Early Christian Lives: Digging in Papyri in a Digital Age

Lead institution: University of Oxford
Final report

This project investigated ancient papyri relevant to the rise of early Christianity within the multi-cultural context of Greco-Roman Egypt. Specifically it examined in detail the complex networks of identity and authority and how Christians saw their new religion as part of their other identities (Greek, Egyptian, Roman, merchant, monk). The rich resource of ancient data came from papyrus documents unearthed from the garbage dumps around the outskirts of Bahnasa in Egypt, known in antiquity as Oxyrhynchus. Building on data from the crowd-sourced transcriptions of the Ancient Lives project, they data mined papyri relevant to early Christianity. To increase the range of their dataset, they developed a transcription tool for Coptic, the final stage of the indigenous language of Egypt, notably used by Christians. They implemented a Coptic language version of Ancient Lives, allowing for the crowd-sourced transcription of these poorly known and unpublished texts, and, for Greek and Coptic texts, developed a unique mining tool for the Ancient Lives database.

The Ancient Lives platform, since it is designed to engage a worldwide user base far beyond the confines of academia, has been recognized as a unique and noteworthy project at the University of Oxford; it was even included in Oxford’s ‘Impact Series’, a showcase of significant research projects, and mentioned in the last REF report. Ancient Lives is now an open platform and other institutions with collections will be able to use the site. The data gathered will be made freely available on the project’s GitHub page, which will allow third party developers to interact with the data and possibly generate ideas that the project did not consider.

Trees and Tweets: Mining Billions to Understand Human Migration and Regional Linguistic Variation

Lead institution: Aston University
Final report

This project focused on analysing regional lexical variation and change in Modern American and British English through the analysis of multi-billion word corpora of geocoded Twitter data collected between 2013 and 2015. Because these are the largest regional corpora ever compiled, their analysis has led to several significant findings. Most notably, by taking advantage of the massive amounts of data available, they have studied the emergence of new words in more detail than has ever been possible before. In particular, they have developed and applied methods for identifying and mapping new word forms and common sources of lexical innovation in large time-stamped and geo-coded corpora. More generally, their research has shown that the relative frequency of almost all words show clear regional patterns when mapped. This is a surprising result to most people, including linguists, and it challenges standard assumptions about the nature of language variation and change.

To allow both researchers and the general public to access the results and better understand regional lexical variation, they developed a free web application called Word Mapper that allows anyone to map the 10,000 most common words in American English. In addition to providing tools and data, they successfully disseminated their research through numerous journal articles and conference presentations, and this project will continue to generate research for the foreseeable future. Notably, this research has also led to productive interdisciplinary collaborations with geographers, anthropologists, computer scientists, economists, and physicists. It has also proven to be of considerable interest to the general public, with several aspects of the research having been covered by hundreds of news outlets worldwide. For example, the research on the identification of new words, regional patterns in swearing, and alternation in the use of hesitation markers has been covered by some the most important international new sources, including The Guardian, The Telegraph, The New York Times, The Washington Post, Time and Popular Science.


This blog was set up for the duration of the Digging into Data 3 challenge. By summarising the final reports from each of the UK projects it brings the third round of Digging into Data to a close. As manager for the programme, I would like to thank all the people involved with each project for their contribution to a successful challenge and wish them all luck in taking forward the outputs of these projects. I would also like to thank ESRC and AHRC for their work with Jisc on Digging into Data.

To read about further challenges, funded through the Trans-Atlantic Platform, please visit the main Digging into Data website.

Social Media Data in Research

An ESRC convened group looking at Big Data, in particular new forms of data such as social media data, led by Dave De Roure (Oxford e-Research Centre) is studying the use of social media for social research. A survey has recently been launched in order to help the group to learn more about how the UK Social Science research community experiences and responds to the challenges of working with social media data. This gathering of evidence about social media data will inform decision-making and build best practice in the research community.

The survey is now live (https://www.isurvey.soton.ac.uk/18266) and the group seeks responses from anyone conducting research with social media data. The survey closes mid-December and the group will be reporting in the New Year.

This study is relevant to the current third round of the Jisc-ESRC-AHRC funded Digging into Data Challenge (http://diggingintodata.org/), in particular projects (http://did3.jiscinvolve.org/wp/) that are using social media data for their research. Trees and Tweets is one such project. This project is a joint effort between Aston University and the University of South Carolina. The team at Aston University has focussed on the analysis of dialect variation based on a corpus of billions of tweets, while the team at the University of South Carolina are looking at the analysis of migration patterns based on a dataset consisting of millions of family trees. The analysis of large Twitter datasets has produced some interesting results that wasn’t anticipated when the project initially submitted their proposal. These results have caught the attention of the media. For example, the use of “um” and “uh” across the US (http://www.theatlantic.com/magazine/archive/2014/12/things-that-make-you-go-um/382243/). More information about this analysis is on the project’s blog (https://sites.google.com/site/jackgrieveaston/treesandtweets) as well as information on the aggregation of swearing data and visual representations of how the use of new words spread geographically.

The Collaborative Online Social Media Observatory (COSMOS) has been analysing social media and data mining for a number of years. Originally funded under Jisc’s Virtual Research Environment (VRE) programme, the project has grown and received further funding from the ESRC to see if Big Social Data can predict offline social phenomena. The project has brought together social, computer, political, health and mathematical scientists to study the methodological, theoretical, and empirical dimensions of Big Data in technical, social and policy contexts. Much of the analysis of social media data has been in the contexts of Societal Safety and Security e.g. social tension, hate speech, crime reporting and fear of crime, and suicidal ideation. The COSMOS system has been used to provide the BBC’s Radio 5 Live with a chart based on the biggest impact stories across social media and online. Using its specially developed unique algorithm it analyses key words and hashtags in Twitter to evaluate and rank the impact of each.

The above examples show how the analysis of social media is producing valuable research. If you are a researcher working with social media, please complete the survey so that your views can be represented in the report.

This post also appears on the Research Data Management blog.

Progress Meeting – 17 June 2015

On Wednesday 17th June, the UK projects funded under round 3 of the Digging into Data challenge gathered together at Paddington for the mid-term progress meeting. This workshop provided projects with the opportunity to present, not just on progress, but on highlights, issues, challenges and share this information with the funders and other projects.

Rather than have a day of listening to presentations the workshop was split into two parts. First of all, after my introduction to the day, projects gave 10 minute presentations followed by 5 minutes of questions. The second part was more workshop focussed, discussing generic issues and challenges, and to hear about the projects’ future plans, both in the second half of the project and post-funding.

You can read my notes for each project’s presentation, followed by a summary of the later discussion, on the Progress Meeting page. As there are 9 project presentations there’s a fair amount to read through. The slides are available in the Jisc Repository under the event ‘Digging into Data 3 Progress Meeting’).


An AHRC Perspective on the Big Humanities Data Workshop

In the UK, Digging into Data phase 3 is funded by AHRC, ESRC and Jisc. Over the next few months each funder will be writing a blog post relevant to Digging into Data. Last October, Christie Walker from AHRC attended the Big Humanities Data Workshop in the USA and she has written the following post about the workshop.

The second Big Humanities Data Workshop took place on 27 October 2014 at the IEEE International Conference on Big Data in Washington D.C. The workshop was attended by a number of academics and funders, including AHRC from the UK, the National Endowment for the Humanities and the Institute of Museum and Library Services from the US, and the Social Sciences and Humanities Research Council from Canada.

The workshop began with an interesting keynote from Michael Levy (Director of Digital Collections) and Michael Haley Goldman (Director of Global Classroom and Evaluation) from the United States Holocaust Museum. Levy and Haley Goldman spoke about the opportunities that big humanities data, new techniques and tools can provide in Holocaust research and education.

The workshop papers covered several themes:

  • Complexity / Scale / Historical Analysis
  • News / Film
  • Frameworks / Infrastructure
  • Geospatial / Mobile
  • Digging into Data

A total of 16 papers were presented at the workshop, and Digging into Data had a strong presence with 7 papers selected. The Digging into Data presentations represented a variety of methods, data types and challenges for the arts, humanities and social sciences:

  • Mining Microdata: Economic Opportunity and Spatial Mobility in Britain and the United States, 1850-1881 (DiD round 2), presented by Evan Roberts – University of Minnesota
  • ‘Understanding the Role of Medical Experts during a Public Health Crisis: Digital Tools and Library Resources for Research on the 1918 Spanish Influenza’, presented by Tom Ewing – Virginia Tech (An Epidemiology of Information: Data Mining the 1918 Influenza Pandemic, DiD round 2)
  • ‘Scaled Entity Search: A Method for Media Historiography and Response to Critiques of Big Humanities Data Research’, presented by Kit Hughes – University of Wisconsin (Project Arclight: Analytics for the Study of 20th Century Media, DiD Round 3)
  • ‘A Computational Pipeline for Crowdsourced Transcriptions of Ancient Greek Papyrus Fragments’, presented by James Brusuelas – University of Oxford (Resurrecting Early Christian Lives: Digging in Papyri in a Digital Age, DiD round 3)
  • ‘Scientific Findings as Big Data for Research Synthesis: The metaBUS Project’, presented by Frank Bosco – Virginia Commonwealth University (Field Mapping: An Archival Protocol for Social Science Research Findings, DiD round 3)
  • ‘Metadata Infrastructure for the Analysis of Parliamentary Proceedings’, presented by Richard Gartner – King’s College London (Digging into Linked Parliamentary Data, DiD round 3)
  • Integrating Data Mining and Data Management Technologies for Scholarly Inquiry (DiD round 2), presented by Richard Marciano – University of Maryland

The workshop concluded with a Funders panel and discussion chaired by Professor Andrew Prescott (University of Glasgow). Brett Bobley (NEH), Bob Horton (IMLS), Crystal Sissons (SSHRC) and Christie Walker (AHRC) discussed their organisations’ approach to big data and funding more generally.

The Big Humanities workshop is unique in that it takes place with the backdrop of a very technical big data conference. However, it highlights to both workshop participants and to the wider IEEE Big Data conference that the arts, humanities and social sciences have a great deal to bring to the conversation about big data and that these disciplines bring their own big data challenges to the table. The workshop generated a lot of very interesting discussion, both in the workshop and beyond.


Things that make you go “um”

Image courtesy Post Typography/The Atlantic

Image courtesy Post Typography/The Atlantic

The Trees and Tweets project is once again in the news. This time the project features in this article in The Atlantic – Things that make you go “um”.

The article discusses how the linguists on the Trees and Tweets project team analysed Twitter data to learn about how both men and women, from different US regions, use words like “um” and “uh.”

You can find out more about the project at the Trees and Tweets Dialect Project Blog and other media interest in this analysis in this previous post.

Digging into Data Phase 2 Projects – Summary and Reports

The Digging into Data Challenge aims to address how “big data” changes the research landscape for the humanities and social sciences. In particular, the four goals of the initiative are:

  • to promote the development and deployment of innovative research techniques in large-scale data analysis that focus on applications for the humanities and social sciences;
  • to foster interdisciplinary collaboration among researchers in the humanities, social sciences, computer sciences, library, archive, information sciences, and other fields, around questions of text and data analysis;
  • to promote international collaboration among both researchers and funders;
  • to ensure efficient access to and sharing of the materials for research by working with data repositories that hold large digital collections.

The Challenge is currently in its third round but reports from the projects funded in round two are available for download from the Jisc repository. This post is a summary of those projects, with a UK partner institution, extracted from their final reports. A link to each final report, in the Repository, is provided after each project’s summary.

Cascades, Islands, or Streams?
The objective of this project was to create and examine large-scale heterogeneous datasets to increase understanding of the scholarly communication system, to identify and analyse various scholarly activities for creating and disseminating new knowledge, and further develop the innovative computer software developed at University of Wolverhampton to collect, filter and analyse data from the web and social media to discover trends in science and in scholarly communication. The results from the project present an argument that transformations in the scholarly communication system affect not only how scholars interact, but also the very substance of these communications, at least in some cases as the audience for the communications is no longer just other researchers but the general public.
Final Report

ChartEx research focussed on the extraction of information from charters using a combination of natural language processing (NLP) and data mining (DM) to establish entities such as locations and related actors, events and dates. The third crucial component of the ChartEx Project was the use of novel instrumental interaction techniques to design a virtual workbench (VWB) that will allow researchers to both refine the processing of the NLP and DM, and to directly manipulate (visualise, confirm, correct, refine, augment, hypothesise) relationships extracted from the set of charters to gain new insights about the entities contained within them.
Final Report

According to the project team, “Working on the DiggiCORE project was a truly amazing experience.” Its goal was to aggregate, at the level of both metadata and content, a vast set of research publications, from institutional repositories, archives (green OA route) and journals (gold OA route) worldwide, and provide novel tools for automatic enrichment of this content with relationships (relatedness, citations). The project provided the following outputs:

  • A software infrastructure delivered to users as a free web service and as a downloadable dataset that enables the analysis of the behaviour of research communities in the Open Access domain;
  • New knowledge and understanding resulting from the data analysis.

Final Report

Digging by Debating
The Digging by Debating project aimed to extract, model, map and visualise argument from a Big Data repository, such as the Hathi Trust Digital Library. It tackled a very ambitious and complex problem, of linking macro visual views of science-philosophy data and state-of-the-art topic modelling and searching to semantically rich analysis and processing (based on argument) of the data. It made significant steps forward in these areas and their interconnection, and produced a constellation of loosely integrated tools and methodologies in this respect. Ultimately their efforts show how computational humanities and linguistics can bridge the gulf between the “big data” perspective of first-generation digital humanities and the close readings and critical interpretations of text that are the “bread and butter” of more traditional scholarship.
Final Report

Digging into Metadata
This project aimed to closely examine the metadata associated with the chosen datasets and enhance that metadata through a variety of automatic, scalable techniques which built on previous collaborative work. Through this enhanced metadata the intention was to enable improved search capability over disparate digital libraries which had hugely varying levels and standards of subject metadata and which would previously have been difficult to search in a consistent way. Through this work they aimed to show firstly that their techniques could enhance poor or inconsistent metadata in a meaningful and consistent way and secondly that this enhanced metadata could lead to improved search functionality which would add value for end users.
Final Report

ELVIS stands for Electronic Locator of Vertical Interval Successions and is a large data-driven research project on musical style. The central unifying concept of the ELVIS project was to study counterpoint: the way combinations of voices in polyphonic music (e.g. the soprano and bass voices in a hymn, or the viola and cello in a string quartet, as well as combinations of more than two voices) interact: i.e. what are the permissible vertical intervals (notes from two voices sounding at the same time) for a particular period, genre, or style. These vertical intervals, connected by melodic motions in individual voices, constitute Vertical Interval Successions. In more modern terms, this could be described as harmonic progressions of chords, but what made ELVIS particularly flexible was its ability to bridge the gap to earlier, contrapuntally-conceived music by using the diad (a two-note combination) rather than the triad (a combination of three notes in particular arrangements) as a basis (since triads and beyond may be expressed as sums of diads). Existing data, while numerous, were somewhat messy, with many duplications, errors, and gaps in certain areas of music history, so one task was the consolidation and cleaning-up of the data both by hand and with newly developed error-correction software.

Altogether, the ELVIS project has enabled not only the consolidation of data and toolsets, but the creation of concrete research output on a previously difficult level. The resulting databank and tools available through the main website at McGill University will prove an invaluable resource to musicologists in this field in the years to come.
Final Report

Imagery Lenses for Visualising Text Corpora
Their team of computer scientists, a linguist, and poet/scholars from the University of Oxford and the University of Utah have been working to create, through computation and visualisation, a richer understanding of how poems work: one that relies on computational tools yet embraces qualitative and quantitative components and explicitly engages human readers and perspectives and research needs specific to the humanities in general and to literature, especially poetry, in particular. This new tool, PoemViewer, by approaching poems as complex dynamic systems, represents a significant step toward providing literary scholars freedom to explore individual poems, bodies of poetry, and other texts of their choosing in ways traditional scholarship and other text analysis software cannot. In addition to displaying familiar poetic features, such as texts, word frequencies, grammatic classes, and sentiment, Poem Viewer provides a unique capability for visualizing poetic sound, including various sonic relationships and changes as they occur in a poem over time.
Final Report

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry
Research on integrating digital library content with computational tools and services has been concerned with examining, analysing, and finding patterns within a data set. Scholars, on the other hand, associate the people, places and events mentioned in texts with other descriptions elsewhere. Thus, while most computational analysis looks inward to the contexts of a particular set of data, scholars tend to look outward, seeking the context for the texts they are studying.

This project went beyond this basic analysis by providing a prototype system developed to provide expert system support to scholars in their work. This system integrated large-scale collections including JSTOR and the books collections of the Internet Archive stored and managed in a distributed preservation environment. It also incorporated text mining and Natural Language Processing software capable of generating dynamic links to related resources discussing the same persons, places, and events.
Final Report

The purpose of the ISHER (Integrated Social History Environment for Research) project has been to apply automated text mining methods to large historical data sources, to demonstrate how these can result in a transformation of the working methods of the researcher, providing an accurate and efficient means of locating and exploring information of interest, with the minimum effort. The project has had a particular focus on the detection of information relating to social unrest, although some of the systems and methods described are more widely applicable to other information of interest to social historians. The project partners have applied sophisticated text mining methods to digitised collections relating to news, i.e., the New York Times (NYT) archive and the National Library of the Netherlands (KB) daily Dutch newspapers archive, together with news reports and related discussions comprising the Automatic Content Extraction (ACE) 2005 Evaluation corpus.

A concrete demonstration of the overall success of the project in achieving its goals comes in the form of two fully functional Web-based interfaces, providing access to the above archives. Each of these interfaces provides users with sophisticated features for searching and browsing the collections, based on the output of text mining analyses. The interfaces are a NYT Search interface and an interface that visualises and links strikes. Work was also carried out in relation to existing text mining frameworks being used by the partners to increase interoperability of components, which enabled, for example, UIUC natural language processing tools to be composed with NaCTeM tools in workflows to process NYT data.
Final Report

Mining Microdata
This project investigates levels of social mobility in Canada, Great Britain and the United States from 1850 to 1911.

  • It uses census records from the 1850s, 1880s, and 1910s to create two panels of men observed in childhood living with their father, and then thirty years later in adulthood.
  • It measures social mobility by comparing fathers’ and sons’ occupations at similar points in their lives.

Further information is available from their project website – http://www.miningmicrodata.org/. The final report will be available once the project completes in January 2015.

Trading Consequences
Scholars interested in nineteenth-century global economic history face a voluminous historical record. Conventional approaches to primary source research on the economic and environmental implications of globalised commodity flows typically restrict researchers to specific locations or a small handful of commodities. By taking advantage of cutting-edge computational tools, the project was able to address much larger data sets for historical research, and thereby provides historians with the means to develop new data-driven research questions. In particular, this project has demonstrated that text mining techniques applied to tens of thousands of documents about nineteenth-century commodity trading can yield a novel understanding of how economic forces connected distant places all over the globe and how efforts to generate wealth from natural resources impacted on local environments.

The large-scale findings that result from the application of these new methodologies would be barely feasible using conventional research methods. Moreover, the project vividly demonstrates how the digital humanities can benefit from trans-disciplinary collaboration between humanists, computational linguists and information visualisation experts.
Final Report

Further information on all three phases of the Digging into Data Challenge can be found on the programme page on the Jisc website and the main Digging into Data website.

New Digging into Data website

The new Digging into Data website (http://www.diggingintodata.org) is up and running. It’s been rebuilt from scratch and I am reliably informed that it’s mobile friendly.

This is a timely release as Monday (27 October) sees the start of the IEEE International Conference on Big Data. Seven Digging into Data projects’ PIs will be presenting papers at the Big Humanities Workshop. AHRC, one of the UK funders, will be in attendance and have volunteered to write about the workshop for this blog.

Trees and Tweets in the media

Although phase 3 projects have only recently started some have already received interest in the media. The Trees and Tweets project is conducting an analysis of dialect variation based on a corpus of billions of tweets and an analysis of migration patterns based on a dataset consisting of millions of family trees.

At the Methods in Dialectology XV conference in Groningen, the Netherlands, the Project Manager for Trees and Tweets, Jack Grieve, presented some of the first results of their study. In fact, he used the data to illustrate the application of some advanced spatial methods for dialectology and produced some quick maps for the popular linguistics blog Language Log. The maps show the significant geographical variation of using “um” and “uh” across the USA. A good example of something the team can only do now with this type of data.

Following on from this blog post, qz.com produced the following article on the results: Um, here’s an, uh, map that shows where Americans use “um” vs. “uh”.

You can find out more about the project at the Trees and Tweets Dialect Project Blog

Digging into Data phase two evaluation ITT

Jisc is seeking to commission a robust and independent formative evaluation report to help guide the future direction of the Digging into Data Challenge.

Jisc (on behalf of Jisc, ESRC and AHRC) invites tenders for an evaluation of the Digging into Data Challenge focussing on phase 2 and emergent lessons from phase 3.

The aim of the evaluation is to produce a report which will:

  1. Evaluate the objectives of DiD2 and explore whether they have been met through the DiD2 projects;
  2. Assess whether the recommendations from the CLIR report have been delivered through the DiD2 projects and are likely to be met through DiD3 and its current cohort of projects;
  3. Capture the advantages gained and lessons learned from projects to date (DiD2 and DiD3) including examples of the benefits of international collaboration;
  4. Look forward to phase four of the Digging into Data Challenge and how this can fit in with the strategy and requirements of ESRC, AHRC and Jisc, including analysing how a fourth phase could relate to other investments in the big data and analytics area that are underway or planned to support education and research.

The deadline for tenders is 12noon UK time on 6 October 2014.

The work under this contract should commence on or around 27 October 2014 and should be completed by 23 January 2015. It is expected that this work will require approximately up to 50 days effort.

To download the ITT (in PDF format) visit the DiD2 Evaluation page. For more information about the challenge and projects, see the Digging into Data programme page.



Welcome to the Digging into Data Challenge Phase 3 blog.

On January 15, 2014 ten international research funders from four countries jointly announced the winners of the third Digging into Data Challenge, a competition to develop new insights, tools and skills in innovative humanities and social science research using large-scale data analysis.

Fourteen teams representing Canada, the Netherlands, the United Kingdom, and the United States received grants to investigate how computational techniques can be applied to “big data”; changing the nature of humanities and social sciences research. Each team represents collaborations among scholars, scientists, and information professionals from leading universities and libraries in Europe and North America.

On the main Digging into Data Challenge website you will find information about all three phases and the projects that have been funded in each phase. this includes details on all 14 international projects funded under phase 3. One of the requirements for these projects is that they involve international collaboration. Out of these 14 projects there are 9 that are led by a UK institutions.

The purpose of this blog is to provide details, news and updates pertaining to these UK projects. Details of all 9 UK led projects are available on the Projects menu. For information about programme meetings, such as the programme start-up meeting, see the Meetings menu.

The funding for the UK institutions come from the Arts and Humanities Research Council (AHRC) and the Economic and Social Research Council (ESRC). Jisc is contributing in this phase by providing programme management and support.

If you would like to find out more about phases 1 and 2, please see the main Digging into Data Challenge website and the Jisc programme page.