Project Title | Automating Data Extraction from Chinese Texts |
Project Website | Website: did-acte.org/ Blog: did-acte.org/blog/ Twitter: @DID_ACTE |
Start Date | 7 July 2014 |
End Date | 6 April 2016 |
UK Project Manager | Hilde De Weerdt Leiden University, Leiden Institute for Area Studies. Tel: +31 (0)71 527 6505 Email: h.g.d.g.de.weerdt@hum.leidenuniv.nl |
Project Team | University of Birmingham Department of History Naomi Standen: +44 (0)121 414 6881; n.standen@bham.ac.uk Harvard University Department of East Asian Languages Peter K. Bol: pkbol@fas.harvard.edu China Biographical Database Project Hongsu (Henry) Wang: hongsuwang@fas.harvard.edu Hui Cheng: huc869@mail.harvard.edu Leiden University Leiden Institute for Area Studies Hilde De Weerdt: h.g.d.g.de.weerdt@hum.leidenuniv.nl Hou Ieong Ho: h.i.ho@hum.leidenuniv.nl Ming-Kin Chu: m.k.chu@hum.leidenuniv.nl |
Lead Institution | Harvard University |
Project Partners | Data Institute of History and Philology, Academia Sinica. www2.ihp.sinica.edu.tw/en/ Platform development Research Center for Digital Humanities, National Taiwan University. www.digital.ntu.edu.tw/en/ Jieh Hsiang: jhsiang@ntu.edu.tw |
Project Plan | http://repository.jisc.ac.uk/5651/ |
Progress Report | http://repository.jisc.ac.uk/6049/ |
Summary
The Automating Data Extraction from Chinese Texts Project aims to provide humanists and social scientists with a means of transforming 2200 years of Chinese texts into structured data. The project will develop an open-source platform (MARKUS) that allows users to apply sophisticated text-mining techniques to a wide variety of historical and literary texts. Users will be able to tag and extract personal names, dates, place names, official titles and postings, kinship ties, other social relationships, and other user-defined content. The platform will be tested against 2000 local histories spanning an 800-year period and roughly 20,000 letters and 500 notebooks dating from the seventh through the thirteenth century. Data extracted from the sample repositories will be used to enrich text-mining applications and will also be made available for research through open-access online databases and data archives.
Objectives
“Automating Data Extraction from Chinese Texts” is designed as an international and interdisciplinary collaboration that will promote research techniques for large-scale structured datasets derived from Chinese texts. Users will be able to upload texts into the project platform, query a desired type of information, tag it, and extract the resultant data into a spreadsheet or other structured format. We aim to capture data in its original context. The platform will allow scholars to not only discover how certain terms are used within a given corpus and in what context, but also analyze related data using geospatial, statistical, and network analysis.
Anticipated Outputs and Outcomes
The project aims to produce three major outputs: (1) an open platform for the tagging and extraction of data from Chinese historical documents (MARKUS), (2) data extracted from local gazetteers and (3) workshops and supporting documentation