Building a Neural Machine Translation System Using Only Synthetic Parallel Data
Jaehong Park, Jongyoon Song, Sungroh Yoon

TL;DR
This paper demonstrates that neural machine translation systems can be effectively built using only synthetic parallel data, introducing a novel pseudo parallel corpus that mixes ground truth and synthetic examples for improved translation quality.
Contribution
It introduces a new pseudo parallel corpus mixing real and synthetic data, enabling NMT training without solely relying on real parallel data.
Findings
Pseudo parallel corpus improves translation quality
Synthetic data alone can train effective NMT systems
Combining real and synthetic data yields the best results
Abstract
Recent works have shown that synthetic parallel data automatically generated by translation models can be effective for various neural machine translation (NMT) issues. In this study, we build NMT systems using only synthetic parallel data. As an efficient alternative to real parallel data, we also present a new type of synthetic parallel corpus. The proposed pseudo parallel data are distinct from previous works in that ground truth and synthetic examples are mixed on both sides of sentence pairs. Experiments on Czech-German and French-German translations demonstrate the efficacy of the proposed pseudo parallel corpus, which shows not only enhanced results for bidirectional translation tasks but also substantial improvement with the aid of a ground truth real parallel corpus.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
