Building a Neural Machine Translation System Using Only Synthetic   Parallel Data

Jaehong Park; Jongyoon Song; Sungroh Yoon

arXiv:1704.00253·cs.CL·September 19, 2017·20 cites

Building a Neural Machine Translation System Using Only Synthetic Parallel Data

Jaehong Park, Jongyoon Song, Sungroh Yoon

PDF

Open Access

TL;DR

This paper demonstrates that neural machine translation systems can be effectively built using only synthetic parallel data, introducing a novel pseudo parallel corpus that mixes ground truth and synthetic examples for improved translation quality.

Contribution

It introduces a new pseudo parallel corpus mixing real and synthetic data, enabling NMT training without solely relying on real parallel data.

Findings

01

Pseudo parallel corpus improves translation quality

02

Synthetic data alone can train effective NMT systems

03

Combining real and synthetic data yields the best results

Abstract

Recent works have shown that synthetic parallel data automatically generated by translation models can be effective for various neural machine translation (NMT) issues. In this study, we build NMT systems using only synthetic parallel data. As an efficient alternative to real parallel data, we also present a new type of synthetic parallel corpus. The proposed pseudo parallel data are distinct from previous works in that ground truth and synthetic examples are mixed on both sides of sentence pairs. Experiments on Czech-German and French-German translations demonstrate the efficacy of the proposed pseudo parallel corpus, which shows not only enhanced results for bidirectional translation tasks but also substantial improvement with the aid of a ground truth real parallel corpus.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies