Ancient-Modern Chinese Translation with a Large Training Dataset

Dayiheng Liu; Jiancheng Lv; Kexin Yang; Qian Qu

arXiv:1808.03738·cs.CL·November 20, 2019

Ancient-Modern Chinese Translation with a Large Training Dataset

Dayiheng Liu, Jiancheng Lv, Kexin Yang, Qian Qu

PDF

TL;DR

This paper introduces a large-scale parallel corpus for Ancient-Modern Chinese translation, along with a novel clause alignment method, enabling improved machine translation and analysis of different models.

Contribution

It presents the first large high-quality Ancient-Modern Chinese dataset and a new clause alignment approach combining lexical and statistical information.

Findings

01

Achieved 94.2 F1-score on clause alignment.

02

Created a 1.24 million bilingual pair corpus.

03

Provided baseline performance comparisons of SMT and NMT models.

Abstract

Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatic translation from ancient Chinese to modern Chinese helps to inherit and carry forward the quintessence of the ancients. However, the lack of large-scale parallel corpus limits the study of machine translation in Ancient-Modern Chinese. In this paper, we propose an Ancient-Modern Chinese clause alignment approach based on the characteristics of these two languages. This method combines both lexical-based information and statistical-based information, which achieves 94.2 F1-score on our manual annotation Test set. We use this method to create a new large-scale Ancient-Modern Chinese parallel corpus which contains 1.24M bilingual pairs. To our best knowledge, this is the first large high-quality Ancient-Modern Chinese dataset. Furthermore, we analyzed and compared the performance of the SMT and various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.