Social Media Data in Research

An ESRC convened group looking at Big Data, in particular new forms of data such as social media data, led by Dave De Roure (Oxford e-Research Centre) is studying the use of social media for social research. A survey has recently been launched in order to help the group to learn more about how the UK Social Science research community experiences and responds to the challenges of working with social media data. This gathering of evidence about social media data will inform decision-making and build best practice in the research community.

The survey is now live (https://www.isurvey.soton.ac.uk/18266) and the group seeks responses from anyone conducting research with social media data. The survey closes mid-December and the group will be reporting in the New Year.

This study is relevant to the current third round of the Jisc-ESRC-AHRC funded Digging into Data Challenge (http://diggingintodata.org/), in particular projects (http://did3.jiscinvolve.org/wp/) that are using social media data for their research. Trees and Tweets is one such project. This project is a joint effort between Aston University and the University of South Carolina. The team at Aston University has focussed on the analysis of dialect variation based on a corpus of billions of tweets, while the team at the University of South Carolina are looking at the analysis of migration patterns based on a dataset consisting of millions of family trees. The analysis of large Twitter datasets has produced some interesting results that wasn’t anticipated when the project initially submitted their proposal. These results have caught the attention of the media. For example, the use of “um” and “uh” across the US (http://www.theatlantic.com/magazine/archive/2014/12/things-that-make-you-go-um/382243/). More information about this analysis is on the project’s blog (https://sites.google.com/site/jackgrieveaston/treesandtweets) as well as information on the aggregation of swearing data and visual representations of how the use of new words spread geographically.

The Collaborative Online Social Media Observatory (COSMOS) has been analysing social media and data mining for a number of years. Originally funded under Jisc’s Virtual Research Environment (VRE) programme, the project has grown and received further funding from the ESRC to see if Big Social Data can predict offline social phenomena. The project has brought together social, computer, political, health and mathematical scientists to study the methodological, theoretical, and empirical dimensions of Big Data in technical, social and policy contexts. Much of the analysis of social media data has been in the contexts of Societal Safety and Security e.g. social tension, hate speech, crime reporting and fear of crime, and suicidal ideation. The COSMOS system has been used to provide the BBC’s Radio 5 Live with a chart based on the biggest impact stories across social media and online. Using its specially developed unique algorithm it analyses key words and hashtags in Twitter to evaluate and rank the impact of each.

The above examples show how the analysis of social media is producing valuable research. If you are a researcher working with social media, please complete the survey so that your views can be represented in the report.

This post also appears on the Research Data Management blog.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Progress Meeting – 17 June 2015

On Wednesday 17th June, the UK projects funded under round 3 of the Digging into Data challenge gathered together at Paddington for the mid-term progress meeting. This workshop provided projects with the opportunity to present, not just on progress, but on highlights, issues, challenges and share this information with the funders and other projects.

Rather than have a day of listening to presentations the workshop was split into two parts. First of all, after my introduction to the day, projects gave 10 minute presentations followed by 5 minutes of questions. The second part was more workshop focussed, discussing generic issues and challenges, and to hear about the projects’ future plans, both in the second half of the project and post-funding.

You can read my notes for each project’s presentation, followed by a summary of the later discussion, on the Progress Meeting page. As there are 9 project presentations there’s a fair amount to read through. The slides are available in the Jisc Repository under the event ‘Digging into Data 3 Progress Meeting’).

 

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

An AHRC Perspective on the Big Humanities Data Workshop

In the UK, Digging into Data phase 3 is funded by AHRC, ESRC and Jisc. Over the next few months each funder will be writing a blog post relevant to Digging into Data. Last October, Christie Walker from AHRC attended the Big Humanities Data Workshop in the USA and she has written the following post about the workshop.

The second Big Humanities Data Workshop took place on 27 October 2014 at the IEEE International Conference on Big Data in Washington D.C. The workshop was attended by a number of academics and funders, including AHRC from the UK, the National Endowment for the Humanities and the Institute of Museum and Library Services from the US, and the Social Sciences and Humanities Research Council from Canada.

The workshop began with an interesting keynote from Michael Levy (Director of Digital Collections) and Michael Haley Goldman (Director of Global Classroom and Evaluation) from the United States Holocaust Museum. Levy and Haley Goldman spoke about the opportunities that big humanities data, new techniques and tools can provide in Holocaust research and education.

The workshop papers covered several themes:

  • Complexity / Scale / Historical Analysis
  • News / Film
  • Frameworks / Infrastructure
  • Geospatial / Mobile
  • Digging into Data

A total of 16 papers were presented at the workshop, and Digging into Data had a strong presence with 7 papers selected. The Digging into Data presentations represented a variety of methods, data types and challenges for the arts, humanities and social sciences:

  • Mining Microdata: Economic Opportunity and Spatial Mobility in Britain and the United States, 1850-1881 (DiD round 2), presented by Evan Roberts – University of Minnesota
  • ‘Understanding the Role of Medical Experts during a Public Health Crisis: Digital Tools and Library Resources for Research on the 1918 Spanish Influenza’, presented by Tom Ewing – Virginia Tech (An Epidemiology of Information: Data Mining the 1918 Influenza Pandemic, DiD round 2)
  • ‘Scaled Entity Search: A Method for Media Historiography and Response to Critiques of Big Humanities Data Research’, presented by Kit Hughes – University of Wisconsin (Project Arclight: Analytics for the Study of 20th Century Media, DiD Round 3)
  • ‘A Computational Pipeline for Crowdsourced Transcriptions of Ancient Greek Papyrus Fragments’, presented by James Brusuelas – University of Oxford (Resurrecting Early Christian Lives: Digging in Papyri in a Digital Age, DiD round 3)
  • ‘Scientific Findings as Big Data for Research Synthesis: The metaBUS Project’, presented by Frank Bosco – Virginia Commonwealth University (Field Mapping: An Archival Protocol for Social Science Research Findings, DiD round 3)
  • ‘Metadata Infrastructure for the Analysis of Parliamentary Proceedings’, presented by Richard Gartner – King’s College London (Digging into Linked Parliamentary Data, DiD round 3)
  • Integrating Data Mining and Data Management Technologies for Scholarly Inquiry (DiD round 2), presented by Richard Marciano – University of Maryland

The workshop concluded with a Funders panel and discussion chaired by Professor Andrew Prescott (University of Glasgow). Brett Bobley (NEH), Bob Horton (IMLS), Crystal Sissons (SSHRC) and Christie Walker (AHRC) discussed their organisations’ approach to big data and funding more generally.

The Big Humanities workshop is unique in that it takes place with the backdrop of a very technical big data conference. However, it highlights to both workshop participants and to the wider IEEE Big Data conference that the arts, humanities and social sciences have a great deal to bring to the conversation about big data and that these disciplines bring their own big data challenges to the table. The workshop generated a lot of very interesting discussion, both in the workshop and beyond.

 

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Things that make you go “um”

Image courtesy Post Typography/The Atlantic

Image courtesy Post Typography/The Atlantic

The Trees and Tweets project is once again in the news. This time the project features in this article in The Atlantic – Things that make you go “um”.

The article discusses how the linguists on the Trees and Tweets project team analysed Twitter data to learn about how both men and women, from different US regions, use words like “um” and “uh.”

You can find out more about the project at the Trees and Tweets Dialect Project Blog and other media interest in this analysis in this previous post.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Digging into Data Phase 2 Projects – Summary and Reports

The Digging into Data Challenge aims to address how “big data” changes the research landscape for the humanities and social sciences. In particular, the four goals of the initiative are:

  • to promote the development and deployment of innovative research techniques in large-scale data analysis that focus on applications for the humanities and social sciences;
  • to foster interdisciplinary collaboration among researchers in the humanities, social sciences, computer sciences, library, archive, information sciences, and other fields, around questions of text and data analysis;
  • to promote international collaboration among both researchers and funders;
  • to ensure efficient access to and sharing of the materials for research by working with data repositories that hold large digital collections.

The Challenge is currently in its third round but reports from the projects funded in round two are available for download from the Jisc repository. This post is a summary of those projects, with a UK partner institution, extracted from their final reports. A link to each final report, in the Repository, is provided after each project’s summary.

Cascades, Islands, or Streams?
The objective of this project was to create and examine large-scale heterogeneous datasets to increase understanding of the scholarly communication system, to identify and analyse various scholarly activities for creating and disseminating new knowledge, and further develop the innovative computer software developed at University of Wolverhampton to collect, filter and analyse data from the web and social media to discover trends in science and in scholarly communication. The results from the project present an argument that transformations in the scholarly communication system affect not only how scholars interact, but also the very substance of these communications, at least in some cases as the audience for the communications is no longer just other researchers but the general public.
Final Report

ChartEx
ChartEx research focussed on the extraction of information from charters using a combination of natural language processing (NLP) and data mining (DM) to establish entities such as locations and related actors, events and dates. The third crucial component of the ChartEx Project was the use of novel instrumental interaction techniques to design a virtual workbench (VWB) that will allow researchers to both refine the processing of the NLP and DM, and to directly manipulate (visualise, confirm, correct, refine, augment, hypothesise) relationships extracted from the set of charters to gain new insights about the entities contained within them.
Final Report

DiggiCORE
According to the project team, “Working on the DiggiCORE project was a truly amazing experience.” Its goal was to aggregate, at the level of both metadata and content, a vast set of research publications, from institutional repositories, archives (green OA route) and journals (gold OA route) worldwide, and provide novel tools for automatic enrichment of this content with relationships (relatedness, citations). The project provided the following outputs:

  • A software infrastructure delivered to users as a free web service and as a downloadable dataset that enables the analysis of the behaviour of research communities in the Open Access domain;
  • New knowledge and understanding resulting from the data analysis.

Final Report

Digging by Debating
The Digging by Debating project aimed to extract, model, map and visualise argument from a Big Data repository, such as the Hathi Trust Digital Library. It tackled a very ambitious and complex problem, of linking macro visual views of science-philosophy data and state-of-the-art topic modelling and searching to semantically rich analysis and processing (based on argument) of the data. It made significant steps forward in these areas and their interconnection, and produced a constellation of loosely integrated tools and methodologies in this respect. Ultimately their efforts show how computational humanities and linguistics can bridge the gulf between the “big data” perspective of first-generation digital humanities and the close readings and critical interpretations of text that are the “bread and butter” of more traditional scholarship.
Final Report

Digging into Metadata
This project aimed to closely examine the metadata associated with the chosen datasets and enhance that metadata through a variety of automatic, scalable techniques which built on previous collaborative work. Through this enhanced metadata the intention was to enable improved search capability over disparate digital libraries which had hugely varying levels and standards of subject metadata and which would previously have been difficult to search in a consistent way. Through this work they aimed to show firstly that their techniques could enhance poor or inconsistent metadata in a meaningful and consistent way and secondly that this enhanced metadata could lead to improved search functionality which would add value for end users.
Final Report

ELVIS
ELVIS stands for Electronic Locator of Vertical Interval Successions and is a large data-driven research project on musical style. The central unifying concept of the ELVIS project was to study counterpoint: the way combinations of voices in polyphonic music (e.g. the soprano and bass voices in a hymn, or the viola and cello in a string quartet, as well as combinations of more than two voices) interact: i.e. what are the permissible vertical intervals (notes from two voices sounding at the same time) for a particular period, genre, or style. These vertical intervals, connected by melodic motions in individual voices, constitute Vertical Interval Successions. In more modern terms, this could be described as harmonic progressions of chords, but what made ELVIS particularly flexible was its ability to bridge the gap to earlier, contrapuntally-conceived music by using the diad (a two-note combination) rather than the triad (a combination of three notes in particular arrangements) as a basis (since triads and beyond may be expressed as sums of diads). Existing data, while numerous, were somewhat messy, with many duplications, errors, and gaps in certain areas of music history, so one task was the consolidation and cleaning-up of the data both by hand and with newly developed error-correction software.

Altogether, the ELVIS project has enabled not only the consolidation of data and toolsets, but the creation of concrete research output on a previously difficult level. The resulting databank and tools available through the main website at McGill University will prove an invaluable resource to musicologists in this field in the years to come.
Final Report

Imagery Lenses for Visualising Text Corpora
Their team of computer scientists, a linguist, and poet/scholars from the University of Oxford and the University of Utah have been working to create, through computation and visualisation, a richer understanding of how poems work: one that relies on computational tools yet embraces qualitative and quantitative components and explicitly engages human readers and perspectives and research needs specific to the humanities in general and to literature, especially poetry, in particular. This new tool, PoemViewer, by approaching poems as complex dynamic systems, represents a significant step toward providing literary scholars freedom to explore individual poems, bodies of poetry, and other texts of their choosing in ways traditional scholarship and other text analysis software cannot. In addition to displaying familiar poetic features, such as texts, word frequencies, grammatic classes, and sentiment, Poem Viewer provides a unique capability for visualizing poetic sound, including various sonic relationships and changes as they occur in a poem over time.
Final Report

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry
Research on integrating digital library content with computational tools and services has been concerned with examining, analysing, and finding patterns within a data set. Scholars, on the other hand, associate the people, places and events mentioned in texts with other descriptions elsewhere. Thus, while most computational analysis looks inward to the contexts of a particular set of data, scholars tend to look outward, seeking the context for the texts they are studying.

This project went beyond this basic analysis by providing a prototype system developed to provide expert system support to scholars in their work. This system integrated large-scale collections including JSTOR and the books collections of the Internet Archive stored and managed in a distributed preservation environment. It also incorporated text mining and Natural Language Processing software capable of generating dynamic links to related resources discussing the same persons, places, and events.
Final Report

ISHER
The purpose of the ISHER (Integrated Social History Environment for Research) project has been to apply automated text mining methods to large historical data sources, to demonstrate how these can result in a transformation of the working methods of the researcher, providing an accurate and efficient means of locating and exploring information of interest, with the minimum effort. The project has had a particular focus on the detection of information relating to social unrest, although some of the systems and methods described are more widely applicable to other information of interest to social historians. The project partners have applied sophisticated text mining methods to digitised collections relating to news, i.e., the New York Times (NYT) archive and the National Library of the Netherlands (KB) daily Dutch newspapers archive, together with news reports and related discussions comprising the Automatic Content Extraction (ACE) 2005 Evaluation corpus.

A concrete demonstration of the overall success of the project in achieving its goals comes in the form of two fully functional Web-based interfaces, providing access to the above archives. Each of these interfaces provides users with sophisticated features for searching and browsing the collections, based on the output of text mining analyses. The interfaces are a NYT Search interface and an interface that visualises and links strikes. Work was also carried out in relation to existing text mining frameworks being used by the partners to increase interoperability of components, which enabled, for example, UIUC natural language processing tools to be composed with NaCTeM tools in workflows to process NYT data.
Final Report

Mining Microdata
This project investigates levels of social mobility in Canada, Great Britain and the United States from 1850 to 1911.

  • It uses census records from the 1850s, 1880s, and 1910s to create two panels of men observed in childhood living with their father, and then thirty years later in adulthood.
  • It measures social mobility by comparing fathers’ and sons’ occupations at similar points in their lives.

Further information is available from their project website – http://www.miningmicrodata.org/. The final report will be available once the project completes in January 2015.

Trading Consequences
Scholars interested in nineteenth-century global economic history face a voluminous historical record. Conventional approaches to primary source research on the economic and environmental implications of globalised commodity flows typically restrict researchers to specific locations or a small handful of commodities. By taking advantage of cutting-edge computational tools, the project was able to address much larger data sets for historical research, and thereby provides historians with the means to develop new data-driven research questions. In particular, this project has demonstrated that text mining techniques applied to tens of thousands of documents about nineteenth-century commodity trading can yield a novel understanding of how economic forces connected distant places all over the globe and how efforts to generate wealth from natural resources impacted on local environments.

The large-scale findings that result from the application of these new methodologies would be barely feasible using conventional research methods. Moreover, the project vividly demonstrates how the digital humanities can benefit from trans-disciplinary collaboration between humanists, computational linguists and information visualisation experts.
Final Report

Further information on all three phases of the Digging into Data Challenge can be found on the programme page on the Jisc website and the main Digging into Data website.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

New Digging into Data website

The new Digging into Data website (http://www.diggingintodata.org) is up and running. It’s been rebuilt from scratch and I am reliably informed that it’s mobile friendly.

This is a timely release as Monday (27 October) sees the start of the IEEE International Conference on Big Data. Seven Digging into Data projects’ PIs will be presenting papers at the Big Humanities Workshop. AHRC, one of the UK funders, will be in attendance and have volunteered to write about the workshop for this blog.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Trees and Tweets in the media

Although phase 3 projects have only recently started some have already received interest in the media. The Trees and Tweets project is conducting an analysis of dialect variation based on a corpus of billions of tweets and an analysis of migration patterns based on a dataset consisting of millions of family trees.

At the Methods in Dialectology XV conference in Groningen, the Netherlands, the Project Manager for Trees and Tweets, Jack Grieve, presented some of the first results of their study. In fact, he used the data to illustrate the application of some advanced spatial methods for dialectology and produced some quick maps for the popular linguistics blog Language Log. The maps show the significant geographical variation of using “um” and “uh” across the USA. A good example of something the team can only do now with this type of data.

Following on from this blog post, qz.com produced the following article on the results: Um, here’s an, uh, map that shows where Americans use “um” vs. “uh”.

You can find out more about the project at the Trees and Tweets Dialect Project Blog

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Digging into Data phase two evaluation ITT

Jisc is seeking to commission a robust and independent formative evaluation report to help guide the future direction of the Digging into Data Challenge.

Jisc (on behalf of Jisc, ESRC and AHRC) invites tenders for an evaluation of the Digging into Data Challenge focussing on phase 2 and emergent lessons from phase 3.

The aim of the evaluation is to produce a report which will:

  1. Evaluate the objectives of DiD2 and explore whether they have been met through the DiD2 projects;
  2. Assess whether the recommendations from the CLIR report have been delivered through the DiD2 projects and are likely to be met through DiD3 and its current cohort of projects;
  3. Capture the advantages gained and lessons learned from projects to date (DiD2 and DiD3) including examples of the benefits of international collaboration;
  4. Look forward to phase four of the Digging into Data Challenge and how this can fit in with the strategy and requirements of ESRC, AHRC and Jisc, including analysing how a fourth phase could relate to other investments in the big data and analytics area that are underway or planned to support education and research.

The deadline for tenders is 12noon UK time on 6 October 2014.

The work under this contract should commence on or around 27 October 2014 and should be completed by 23 January 2015. It is expected that this work will require approximately up to 50 days effort.

To download the ITT (in PDF format) visit the DiD2 Evaluation page. For more information about the challenge and projects, see the Digging into Data programme page.

 

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Welcome

Welcome to the Digging into Data Challenge Phase 3 blog.

On January 15, 2014 ten international research funders from four countries jointly announced the winners of the third Digging into Data Challenge, a competition to develop new insights, tools and skills in innovative humanities and social science research using large-scale data analysis.

Fourteen teams representing Canada, the Netherlands, the United Kingdom, and the United States received grants to investigate how computational techniques can be applied to “big data”; changing the nature of humanities and social sciences research. Each team represents collaborations among scholars, scientists, and information professionals from leading universities and libraries in Europe and North America.

On the main Digging into Data Challenge website you will find information about all three phases and the projects that have been funded in each phase. this includes details on all 14 international projects funded under phase 3. One of the requirements for these projects is that they involve international collaboration. Out of these 14 projects there are 9 that are led by a UK institutions.

The purpose of this blog is to provide details, news and updates pertaining to these UK projects. Details of all 9 UK led projects are available on the Projects menu. For information about programme meetings, such as the programme start-up meeting, see the Meetings menu.

The funding for the UK institutions come from the Arts and Humanities Research Council (AHRC) and the Economic and Social Research Council (ESRC). Jisc is contributing in this phase by providing programme management and support.

If you would like to find out more about phases 1 and 2, please see the main Digging into Data Challenge website and the Jisc programme page.

 

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS