Project Title | Trees & Tweets: Mining Billions to Understand Human Migration and Regional Linguistic Variation |
Project Website | https://sites.google.com/site/jackgrieveaston/treesandtweets |
Start Date | May 2014 |
End Date | April 2016 |
UK Project Manager | Jack Grieve, Aston University, j.grieve1@aston.ac.uk |
Project Team | Jack Grieve, Andrea Nini |
Lead Institution | Aston University (www.aston.ac.uk) |
Project Partners | Diansheng Guo, University of South Carolina (www.sc.edu) |
Project Plan | http://repository.jisc.ac.uk/5655/ |
Progress Report | http://repository.jisc.ac.uk/6050/ |
Summary
This project is a joint effort between Aston University in the United Kingdom and the University of South Carolina in the United States. The overall goal of the project is to map dialect variation and human migration patterns in the United Kingdom and the United States and to understand the extent to which migration patterns explain regional linguistic variation. The team at Aston University will focus on the analysis of dialect variation based on a corpus of billions of tweets, while the team at the University of South Carolina will focus on the analysis of migration patterns based on a dataset consisting of millions of family trees. Both halves of the project are based on very large datasets whose size and complexity necessitate the applications and development of advanced techniques for data mining, data visualisation, natural language processing, and spatial analysis.
The team at Aston University will analyse regional dialect variation in a multi-billion corpus of geocoded and time-stamped tweets from the UK and the US, tweeted over the course of 2013. Specifically the analysis will focus on regional variation and change in vocabulary in modern British and American English. Two related lines of research will be pursued. First, a wide variety of standard lexical alternations will be measured and mapped across both nations (e.g. among/amongst, pail/bucket, lift/elevator) to produce modern lexical dialect atlases for both varieties of English. Second, the geographic spread of new vocabulary items will be analysed by identifying and mapping how new words spread across and between the two nations.
Finally, the migration patterns identified by the team in South Carolina will be compared to the patterns of regional linguistic variation and change identified by the team at Aston. Although dialect regions are generally assumed to be sensitive historic and contemporary migration patterns, by combining these two sources of big data, the project will be able to examine this relationship directly.
Objectives
The main objectives of the project are to gain a better understanding of regional linguistic variation and human migration patterns and their relationship in both the United Kingdom and the United States through the quantitative analysis of big data.
Anticipated Outputs and Outcomes
The anticipated outputs and outcomes is a series of international presentations and articles in leading linguistics and geography conferences and journals. In addition, we plan on creating an interactive website that will allow both researchers and the public to access vocabulary dialect atlases for the United Kingdom and the United States.