On Wednesday 17th June, the UK projects funded under round 3 of the Digging into Data challenge gathered together at Paddington for the mid-term progress meeting. This workshop provided projects with the opportunity to present, not just on progress, but on highlights, issues, challenges and share this information with the funders and other projects.
Rather than have a day of listening to presentations the workshop was split into two parts. First of all, after my introduction to the day, projects gave 10 minute presentations followed by 5 minutes of questions. The second part was more workshop focussed, discussing generic issues and challenges, and to hear about the projects’ future plans, both in the second half of the project and post-funding.
The following are my notes for each project’s presentation, followed by a summary of the later discussion. As there are 9 project presentations there’s a fair amount to read through. The slides are available in the Jisc Repository under the event ‘Digging into Data 3 Progress Meeting’).
Presentations
DiLiPad – Jonathan Blaney
Working with the Netherlands and Canada and parliamentary data they are producing lots of tools to find patterns (linguistic features) in the data. One of the main issues has been the difficulty of communication with international collaborators. They use Google Hangouts but this has not proved ideal and communication has been a problem, particularly in the early stages of the project. Eight years of UK data has already been marked-up and they are now looking at the more difficult data. It’s been stimulating for everyone on the project and they are ahead of plan. One highlight has been two researchers (one UK and one NL) have been researching women in parliament, something that wasn’t in the plan but the sort of research they can now perform as gender has been marked-up for the first time. This has resulted in a draft paper about to be submitted to a humanities journal. Another highlight has been researching how individual MPs have contributed to debates, for example Tom Driberg and whether he contributed to any debates on homosexuality. The work they are doing is putting new forms of data in the hands of historians. A recurring issue with many projects is sustainability and how the tools developed and the data can remain available once funding is over. They are currently looking at how this can be resolved.
Commonplace Cultures – Min Chen
They have a cross-disciplinary and international team based at Oxford and Chicago which includes experts, such as Ian Foster. Issues have been around recruiting a RA and part-time work. Funding has not been well synchronised resulting in partners working to different timeframes. They have been able to communicate via a mix of virtual and face-to-face meetings with brainstorming sessions early on, followed by design meetings to look at requirements and software. A three day meeting to test the software and make quick fixes proved useful. The software was evaluated at a meeting in Oxford and it was released in March. The work involved using machine learning techniques with French text. The tool looks at similarities between two pieces of text and is very context sensitive. They have created a programming interface for humanities scholars – for word, language and visual processing.
This simple interface allows them to drop items into boxes and click buttons to run the model used for detection. After an hour humanities scholars don’t realise they are actually writing programs. The software is online and in beta at – www.ovii.org/vita.
DID-ACTE (Chinese Texts) – Brent Ho
A collaboration between Harvard, Leiden and Birmingham to develop a system, taking 2000 years of data and text from Chinese local gazetteers and transforming it into structured data. In the first phase they have focussed on system development and in the second they will use, test and improve it. Users can upload text to the system or copy it in and then mark-up the text. The open-source platform (MARKUS) has been designed to be simple to use and allows users to apply sophisticated text-mining techniques to a wide variety of historical and literary texts. It provides tools to help researchers search using expressions. The marked-up data can be exported for further processing. In the next year there will be additional activities developing research tools, further data extraction and workshops. To help the users lots of tutorial videos have been developed. These are short and focus on one particular action. The project has been promoted via social media and the website. Successes have been bringing in more collaborators and a broader audience, as well as sharing expertise with other projects. Again, sustainability is an issue. How can the project be maintained post-funding when the number of potential users is small and the datasets are classical Chinese, which isn’t widely used? If they don’t find a way to keep it going they will put it up on a free website.
MIRACLE – Lorenzo Milazzo
A collaboration between the UK, NL, USA and Canada this project is using agent based modelling to explore dynamic feedbacks between human resource use and the biophysical environment. The objective is to share knowledge (creating a share repository), and process and analyse the data. In the UK they have, so far, built a data model for the simulation – a metadata standard for social simulation. They will propose the standard to the community later on. The Social Simulation Repository Interface library and API have been developed. After defining the data model, they will process the data and populate the database. Using FEARLUS-SPOMM (FS) model to explore land managers’ responses to agri-environmental incentive policies and their impact on landscape-scale species diversity. They are using a FS Simulation Pipeline to run the simulation, process the data, and output to a complex file. In next phase there will be further extensions/upgrades of the metadata standard for social simulations, investigate the possibility of using an agent-based approach to implement a (semi-)automated system for processing and analysis and promoting via reports, publications and presentations.
Early Christian Lives – James Brusuelas (recorded)
Unfortunately, no one was able to attend from the project but James sent a recording of his presentations.
They have been putting the Coptic alphabet into their Ancient Lives system. This hasn’t been too difficult getting into the Zooniverse system, but they have had to redesign it as the Juggernaut platform is being retired. They took the opportunity to rebuild and upgrade it to make it look more like a modern Zooniverse system. The Ancient Lives interface has been tested with users. As they are redesigning they have looked at the Consensus algorithm. For this they have taken two approaches, Kernel-based (Matlab – takes a week and a half to process, which is too long) and stepwise (Python that processes in 10-15 minutes), to produce the algorithm with the goal to aggregate millions of user clicks into accurate consensus letter identifications. They have adopted the BLAST (Basic Local Alignment Search Tool) genetic sequence alignment algorithm from bioinformatics (released in 1990 but still widely used today), to help accelerate papyrus identification. They’ve added in the Greek alphabet and now added Coptic. Can take strings of text and test against known text sequences to do automated identification. Nice way of sifting through and identifying across hundreds and thousands of fragments automatically. Have produced a Coptic sandbox for testing. Future work, for the remainder of the year, includes further testing of adaptable mining algorithms, new tasks for Ancient Lives users to help mining process, launch new Ancient Lives site with Coptic, run new consensus algorithm and test mining tools.
Digging into Signs – Kearsy Cormier (recorded)
This was the second project that had no one available for the workshop and produced a recorded presentation. This project is unique as, being a 12 months project, it had recently completed and submitted a final report.
A collaboration between UCL and Radboud University Nijmegen in the Netherlands. There are many sign languages all over the world. There are some dictionaries and some grammatical descriptions of sign languages, but most of them are based on very little data and yet quantitative linguistic analysis of any kind of language data relies on corpora. Examples include the British National Corpus of English. The aim of this project was to create clear standards for addressing problems with sign language annotation, to test their reliability and validity and to improve software tools that can help with workflow. Overall, these aims have been achieved. Highlights included a workshop hosted at the end of March, the release of our annotation standards at the end of May, and software tools to assist in annotation. Their annotation standards were made available on the project website just after the project finished a few weeks ago, and they are now working on making sure that the actual annotation files, that were the product of these standards, will be put online along with release notes within the next week or so. One change to the project was in the software tools. Initial plans on using a lexicon look-up tool called LEXUS changed. As the project progressed, they decided that LEXUS was not the right tool for them, largely because the developers decided not to continue supporting and developing it any more. Therefore they moved their effort towards linking up with SignBank, an online dictionary system that was already available for BSL Corpus and being developed for Corpus NGT anyway. So, instead they now have a lookup tool that links with the SignBank dictionary system to aid in annotation of some types of signs. Though they did pilot the annotation standards with samples of both BSL and NGT data as promised as deliverables, they were not able to fully implement the annotation standards across the full datasets of BSL and NGT video data so this is work that is planned for the future.
Mining Biodiversity – Riza Batista-Navarro
This project includes several partners, including NaCTeM and US/Canada partners working to transform the Biodiversity Heritage Library (BHL) into a next-generation social digital library. They are bringing together strengths from different disciplines, including semantic metadata, data visualisation, text mining, machine learning and social data. The BHL is a huge resource that contains millions of pages from pdf files. Most have been OCR processed already and is open access. It’s an important resource, in particular as you need to know about any species before you can protect it.
It supports searching but only keyword based at the moment and provides lots of APIs. In the project they want to enhance the search system. What they have produced currently is a mock-up system. They’ve surveyed users to test the mock-up and following the survey they have added faceted search, time sensitive search, and automatically generated questions based on search. The first outcome has been to develop algorithms to correct errors in OCR. They’ve developed a corpus and tools for semantic metadata generation. Annotations show links between the semantic metadata.
It’s not been easy to develop these tools and they had to create their own corpus. They employed annotators to manually annotate text and used a crowd-sourcing approach to ask if statements were valid or not. Partners have produced visualisation of terms. They still have to work on an annotated corpus and domain adapted text mining tools. Challenges are that it’s hard to find expert curators, realising after bidding for the project that there is a need for high performance computing and it takes time to process the data. They are handling 60Gb of data and need money for time on cloud rather than on a server.
Trees and Tweets – Jack Grieve
A collaboration between Aston and South Carolina, this project has three main goals:
- Analyse and map regional lexical variation in modern British and American English based on a multi-billion word corpus of geo-coded Tweets.
- Analyse and map settlement patterns in US and UK based on a collection of geo-coded family trees.
- Compare dialect and migration patterns.
During the first year the first two steps have come out as more interesting. All data comes from their US partner and they don’t gather the data in the UK. Also, they don’t have anything to do with the migration analysis, which is moving slower than the other work. Within this project they have looked at a number of areas. This includes analysing tweets for use of hesitation markers (“um” and “uh”), and words such as “dudes”. This work has generated quite a bit of press coverage with qz.com.
The main research they’ve been working on for the past year is developing methods for identifying newly emerging words in very large time-stamped corpora and using the method to analyse how the usage of new words changes over time. They submit their first paper this month. Graphs showed the frequency of words such of “and”, “strawberry” and “unbothered” over time. They also look at slang words and how they spread. This led to building a quiz for the NY Times on trendy terms you should know. The US Team has also been analysing dialect variation in the Twitter data, including applying new methods for spatial analysis and replicating. Ongoing research includes:
- mapping emergence. How words spread across the US.
- mapping lexical variation. 67K words. Can make analysis available but not the Twitter data. Nothing like this has existed before. Have maps for common words. Creating aggregated maps.
Challenges include:
- only now getting into UK data
- family tree data harder to get and slower process
- didn’t put enough money aside for processing power. Shifted money around but a cloud solution would’ve been good.
MacMillan interested in their work and some advertisers interested too, but in sentiment analysis which isn’t what this project’s about.
DADAISM – Christopher Power
Partners include University of York, University of Saskatchewan and University of Amsterdam. Working with the Archaeology Data Service to produce the next-generation of tools for archaeologists. Current tools aren’t good. There are big problems with incomplete metadata. Archaeologists tend to dig, write the report and then no one can find the information. Their current focus has been on flint tools and Anglo-Scandinavian brooches. They are looking at human-computer interaction. They’ve completed 10 contextual enquiry interviews with a range of archaeologists. They identified two distinct phases of work – identification and analysis. There were so many consistencies after 7-8 interviews they had enough information. The search is necessary but there’s a workflow to the process – maintaining personal collections and comparing items to items in larger datasets. Four different types of search identified to support – exact, group, reverse, non-artefact.
Results of interactive system design produced 5 different personas to use in design. They’ve done some prototyping for basic search and re-orientated the traditional search interface. Archaeologists want to compare their archive to other archives across the world. In the contextual interviews a lot of individuals were trying to work from a reference image to find similar images. Finding via keywords can be difficult due to incomplete metadata. The dream system would be to put in their image and get back everything related to that image. They are getting high matches in comparisons – 90% accuracy. Text mining has been a challenge with a high number of grey literature and other documents. They are getting out a lot of information from search and should be able to test the system against terms returned from grey literature. Temporal terms are giving most issues – definition of time very different. Current challenges include:
- expert usability evaluations in next couple of weeks.
- problems with archaeologists’ tasks and not supported by existing data/standards.
- brooch work progressing but substantially more varied than the flint tools with rotation and fragment matching.
- linking out to other repositories more challenging than expected due to non-standard or absent APIs. Image processing cannot happen in real time
Discussion
The afternoon session was facilitated by AHRC and ESRC and focussed on two main areas:
- Issues and Challenges (faced by the projects and how these might be resolved)
- Future Plans (for the rest of the project and beyond. What plans are there to build on successes and show impact)
DiD3 Issues and Challenges
The following list of issues and challenges came up from this morning’s presentations and attendees from each project discussed these and added to them based on their experience.
Initial list:
- International communication/working
- Interdisciplinary working
- co-authorship
- accreditation (of Co-Is)
- Sustainability
- Storage
- How to ensure re-use
- Computing power
- Sharing of large datasets
- Re-using / re-purposing tools
- Recruitment, both sufficient time to recruit and challenges in recruiting subject experts
- FEC models vs Overseas funding models
Others added through discussion:
Recruitment. Often difficult to recruit the right people at the right time.
No money for cloud storage.
Do we need to look at different countries models or put more money into these areas? Discussions after the last round as to whether the level of funding was enough.
Need capital to purchase cloud storage or power.
If working with big data there is a need for processing power.
Fund development time, storage capacity, cloud processing time, hosting web services.
This is the message trying to get through to funders.
Is there a role for Cloud services provided by Jisc?
Harmonising start dates. Trees and tweets – US started before funding. Some advantages to spreading length of time. Grant letters go out at different times based on funders’ processes.
Some started earlier. Some knew about funding before others.
Need harmonised reporting and project management requirements across funders. Harmonise what’s on the grant letters.
Complaints about the burden of Jisc’s project management requirements, although milestones and deliverables were clearly defined on the signed grant letters.
A couple of projects highlighted solutions to some of the issues. These included:
- Mining Biodiversity – have regular meetings with partners. Use a Wiki to keep track of work.
- DADAISM – using Basecamp for project management. Consortium agreements helpful.
There are fewer communication problems with projects with just two partners. Larger teams seem to have the most issues. Projects were uncertain how many partners they should have in a bid. Projects felt they needed to have more partners for a greater chance of success. It helps to minimise dependencies. It’s a dilemma as, although it’s easier to have fewer partners, the funders want to encourage collaboration. Funders want to see added value from these international relationships, but if it’s too much and causes too many issues then it’s not adding value. There needs to be additional value from the programme above what you might get from just funding UK research. There are often cultural issues between international partners.
Future
Knowing what you do now, what would you tell yourself 18 months ago?
A loose coupling of project deliverables, less idle time.
More money into computing power. Think about cloud solution. Having had no experience of working with 10K words didn’t realise these requirements.
Better guidance from funders possibly.
Potential partner base extended to other countries that might have less experience of these areas. Partner countries that have less experience with those that have more. Have different levels of experience in projects.
With big data don’t know how much space and capacity needed till start working on the data.
DADAISM using a research hub to provide information and guidance. Similar to research data hub. Like a Toolkit. This has proved useful.
Conference
AHRC and ESRC presented a draft agenda for the end of programme conference planned for two days in January/February 2016 in Glasgow, although this hasn’t been confirmed. It’s currently in the early stages of planning but have approval to host in UK. The funders value input from the projects and what would be useful for future projects.
The agenda includes 12 minutes for project presentations and parallel sessions. Overall the projects weren’t positive about the short time allowed to present and the clashing of sessions. The funders want to add value to the event rather than just have a lot of presentations to sit through. Projects felt that international partners would not travel to the UK for such a short presentation. It was suggested that each project produces a poster and has a stand. This would be good for networking opportunities. Videos could be produced for the projects. Projects also felt the event should open with Digging presentations not general ones. People won’t know who to talk to unless have presentations or videos first.
Based on the feedback provided the funders will revise the agenda.
Summary
The day closed with a brief summary of the meeting. The presentations were all interesting and there had been a lot of useful points raised in the discussion. It had been an enjoyable event and it was appreciated that projects had made the effort to attend. Projects were reminded that we are here to help and promote their projects as needed.