Transfer Learning across Several Centuries: Machine and Historian Integrated Method to Decipher Royal Secretary's Diary
Sojung Lucia Kim, Taehong Jang, Joonmo Ahn, Hyungil Lee and, Jaehyuk Lee

TL;DR
This study develops a specialized Korean historical corpus with annotated entities, fine-tunes language models for NER, and demonstrates that phrase markers significantly enhance the detection of unseen entities in centuries-old documents.
Contribution
Introduces a Korean historical corpus with annotated entities, and shows that phrase markers improve NER performance on historical texts.
Findings
Phrase markers improve NER accuracy on unseen entities.
Corpus-specific models alone do not outperform pretrained models.
Combining time and annotation info benefits historical text analysis.
Abstract
A named entity recognition and classification plays the first and foremost important role in capturing semantics in data and anchoring in translation as well as downstream study for history. However, NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model. This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated. We fined-tuned the language model on history corpus, conducted extensive comparative experiments using our language model and pretrained muti-language models. We set up the hypothesis of combination of time and annotation information and tested it based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
