Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation
Jinyi Zhang, Tadahiro Matsumoto

TL;DR
This paper introduces a corpus augmentation technique for low-resource neural machine translation that segments sentences and generates pseudo-parallel pairs, improving translation quality for Japanese-Chinese pairs.
Contribution
It proposes a novel sentence segmentation and back-translation method to augment low-resource corpora for NMT.
Findings
Improved translation performance on Japanese-Chinese datasets
Effective sentence segmentation enhances pseudo-parallel data quality
Method benefits low-resource language pair translation
Abstract
Neural Machine Translation (NMT) has been proven to achieve impressive results. The NMT system translation results depend strongly on the size and quality of parallel corpora. Nevertheless, for many language pairs, no rich-resource parallel corpora exist. As described in this paper, we propose a corpus augmentation method by segmenting long sentences in a corpus using back-translation and generating pseudo-parallel sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with Japanese-Chinese scientific paper excerpt corpus (ASPEC-JC) show that the method improves translation performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies
