TL;DR
This paper critically examines the effectiveness of using Classical Chinese resources for processing Korean and Japanese historical documents, finding minimal transfer benefits in most scenarios and highlighting the importance of empirical validation.
Contribution
It challenges the assumption that Classical Chinese datasets significantly aid cross-lingual tasks for Korean and Japanese historical texts, providing comprehensive experimental evidence.
Findings
Classical Chinese resources have limited impact on Hanja language tasks.
Performance gains from Classical Chinese data diminish as local language data increases.
Substantial improvements are only observed in extremely low-resource settings.
Abstract
Historical documents in the Sinosphere are known to share common formats and practices, particularly in veritable records compiled by court historians. This shared linguistic heritage has led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan, which remain relatively low-resource. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within F1-score for sequence labeling tasks and up to BLEU score for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
