HUE: Pretrained Model and Dataset for Understanding Hanja Documents of   Ancient Korea

Haneul Yoo; Jiho Jin; Juhee Son; JinYeong Bak; Kyunghyun Cho; Alice Oh

arXiv:2210.05112·cs.CL·October 12, 2022

HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Haneul Yoo, Jiho Jin, Juhee Son, JinYeong Bak, Kyunghyun Cho, Alice Oh

PDF

1 Repo

TL;DR

This paper introduces a new dataset and BERT-based models for understanding ancient Hanja documents from Korea, significantly aiding historians by improving language comprehension and analysis of historical texts.

Contribution

The paper releases the Hanja Understanding Evaluation dataset and trains BERT models on historical Korean corpora, demonstrating improved performance on multiple NLP tasks for ancient texts.

Findings

01

Models trained on historical corpora outperform baselines.

02

Significant improvements in classification and retrieval tasks.

03

Zero-shot experiments show potential for analyzing lesser-studied texts.

Abstract

Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. We compare the models with several baselines on all tasks and show there are significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haneul-yoo/hue
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings