Corpus Augmentation by Sentence Segmentation for Low-Resource Neural   Machine Translation

Jinyi Zhang; Tadahiro Matsumoto

arXiv:1905.08945·cs.CL·May 23, 2019·5 cites

Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation

Jinyi Zhang, Tadahiro Matsumoto

PDF

Open Access

TL;DR

This paper introduces a corpus augmentation technique for low-resource neural machine translation that segments sentences and generates pseudo-parallel pairs, improving translation quality for Japanese-Chinese pairs.

Contribution

It proposes a novel sentence segmentation and back-translation method to augment low-resource corpora for NMT.

Findings

01

Improved translation performance on Japanese-Chinese datasets

02

Effective sentence segmentation enhances pseudo-parallel data quality

03

Method benefits low-resource language pair translation

Abstract

Neural Machine Translation (NMT) has been proven to achieve impressive results. The NMT system translation results depend strongly on the size and quality of parallel corpora. Nevertheless, for many language pairs, no rich-resource parallel corpora exist. As described in this paper, we propose a corpus augmentation method by segmenting long sentences in a corpus using back-translation and generating pseudo-parallel sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with Japanese-Chinese scientific paper excerpt corpus (ASPEC-JC) show that the method improves translation performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies