Trees and Tweets

Project TitleTrees & Tweets: Mining Billions to Understand Human Migration and Regional Linguistic Variation
Project Website
Start DateMay 2014
End DateApril 2016
UK Project ManagerJack Grieve, Aston University,
Project TeamJack Grieve, Andrea Nini
Lead InstitutionAston University (
Project PartnersDiansheng Guo, University of South Carolina (
Project Plan
Progress Report


This project is a joint effort between Aston University in the United Kingdom and the University of South Carolina in the United States. The overall goal of the project is to map dialect variation and human migration patterns in the United Kingdom and the United States and to understand the extent to which migration patterns explain regional linguistic variation. The team at Aston University will focus on the analysis of dialect variation based on a corpus of billions of tweets, while the team at the University of South Carolina will focus on the analysis of migration patterns based on a dataset consisting of millions of family trees. Both halves of the project are based on very large datasets whose size and complexity necessitate the applications and development of advanced techniques for data mining, data visualisation, natural language processing, and spatial analysis.

The team at Aston University will analyse regional dialect variation in a multi-billion corpus of geocoded and time-stamped tweets from the UK and the US, tweeted over the course of 2013. Specifically the analysis will focus on regional variation and change in vocabulary in modern British and American English. Two related lines of research will be pursued. First, a wide variety of standard lexical alternations will be measured and mapped across both nations (e.g. among/amongst, pail/bucket, lift/elevator) to produce modern lexical dialect atlases for both varieties of English. Second, the geographic spread of new vocabulary items will be analysed by identifying and mapping how new words spread across and between the two nations.

Finally, the migration patterns identified by the team in South Carolina will be compared to the patterns of regional linguistic variation and change identified by the team at Aston. Although dialect regions are generally assumed to be sensitive historic and contemporary migration patterns, by combining these two sources of big data, the project will be able to examine this relationship directly.


The main objectives of the project are to gain a better understanding of regional linguistic variation and human migration patterns and their relationship in both the United Kingdom and the United States through the quantitative analysis of big data.

Anticipated Outputs and Outcomes

The anticipated outputs and outcomes is a series of international presentations and articles in leading linguistics and geography conferences and journals. In addition, we plan on creating an interactive website that will allow both researchers and the public to access vocabulary dialect atlases for the United Kingdom and the United States.

Print Friendly, PDF & Email

Leave a Reply

The following information is needed for us to identify you and display your comment. We’ll use it, as described in our standard privacy notice, to provide the service you’ve requested, as well as to identify problems or ways to make the service better. We’ll keep the information until we are told that you no longer want us to hold it.
Your email address will not be published. Required fields are marked *