Investigating Backtranslation in Neural Machine Translation
Alberto Poncelas, Dimitar Shterionov, Andy Way, Gideon Maillette de, Buy Wenniger, Peyman Passban

TL;DR
This paper investigates the effects of using back-translated monolingual data on neural machine translation performance, analyzing how different amounts influence translation quality in German-to-English translation.
Contribution
It provides a detailed analysis of how back-translated data impacts NMT performance, including effects of data size and combination with authentic data.
Findings
Back-translated data improves NMT performance, especially in resource-scarce scenarios.
The impact varies with the amount of back-translated data used.
Combining back-translated with authentic data yields better results than using either alone.
Abstract
A prerequisite for training corpus-based machine translation (MT) systems -- either Statistical MT (SMT) or Neural MT (NMT) -- is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in cases where data is limited, SMT can still outperform NMT. Recently researchers have shown that back-translating monolingual data can be used to create synthetic parallel corpora, which in turn can be used in combination with authentic parallel data to train a high-quality NMT system. Given that large collections of new parallel text become available only quite rarely, backtranslation has become the norm when building state-of-the-art NMT systems, especially in resource-poor scenarios. However, we assert that there are many unknown factors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
