Shared Heritage, Distinct Writing: Rethinking Resource Selection for East Asian Historical Documents

Seyoung Song; Haneul Yoo; Jiho Jin; Kyunghyun Cho; Alice Oh

arXiv:2411.04822·cs.CL·March 24, 2026

Shared Heritage, Distinct Writing: Rethinking Resource Selection for East Asian Historical Documents

Seyoung Song, Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh

PDF

1 Repo

TL;DR

This paper critically examines the effectiveness of using Classical Chinese resources for processing Korean and Japanese historical documents, finding minimal transfer benefits in most scenarios and highlighting the importance of empirical validation.

Contribution

It challenges the assumption that Classical Chinese datasets significantly aid cross-lingual tasks for Korean and Japanese historical texts, providing comprehensive experimental evidence.

Findings

01

Classical Chinese resources have limited impact on Hanja language tasks.

02

Performance gains from Classical Chinese data diminish as local language data increases.

03

Substantial improvements are only observed in extremely low-resource settings.

Abstract

Historical documents in the Sinosphere are known to share common formats and practices, particularly in veritable records compiled by court historians. This shared linguistic heritage has led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan, which remain relatively low-resource. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $\pm 0.0068$ F1-score for sequence labeling tasks and up to $+ 0.84$ BLEU score for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seyoungsong/classical-chinese-transfer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.