Mining Local Gazetteers of Literary Chinese with CRF and Pattern based   Methods for Biographical Information in Chinese History

Chao-Lin Liu; Chih-Kai Huang; Hongsu Wang; Peter K. Bol

arXiv:1511.01556·cs.CL·November 17, 2016

Mining Local Gazetteers of Literary Chinese with CRF and Pattern based Methods for Biographical Information in Chinese History

Chao-Lin Liu, Chih-Kai Huang, Hongsu Wang, Peter K. Bol

PDF

TL;DR

This paper develops CRF and pattern-based methods to automatically extract biographical names and locations from Chinese historical texts, significantly aiding historical research and database compilation.

Contribution

It introduces novel algorithmic approaches for recognizing named entities in literary Chinese and extends to mining document structures in historical texts.

Findings

01

High accuracy in extracting names and addresses from gazetteers

02

Thousands of biographical entities identified, many matching existing databases

03

Potential to expand the China Biographical Database with new verified data

Abstract

Person names and location names are essential building blocks for identifying events and social networks in historical documents that were written in literary Chinese. We take the lead to explore the research on algorithmically recognizing named entities in literary Chinese for historical studies with language-model based and conditional-random-field based methods, and extend our work to mining the document structures in historical documents. Practical evaluations were conducted with texts that were extracted from more than 220 volumes of local gazetteers (Difangzhi). Difangzhi is a huge and the single most important collection that contains information about officers who served in local government in Chinese history. Our methods performed very well on these realistic tests. Thousands of names and addresses were identified from the texts. A good portion of the extracted names match the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.