Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation
Nikolay Bogoychev, Rico Sennrich

TL;DR
This paper investigates how domain, translationese, and noise affect the effectiveness of forward and back-translation methods in neural machine translation, revealing that their performance varies based on data characteristics and resource availability.
Contribution
It provides a detailed analysis of when and why forward and back-translation are effective, highlighting the influence of translationese, domain differences, and resource constraints.
Findings
Forward translation improves BLEU on source-language original sentences.
Back-translation yields larger gains on target-language original sentences.
Forward translation is more sensitive to initial translation quality and low-resource conditions.
Abstract
The quality of neural machine translation can be improved by leveraging additional monolingual resources to create synthetic training data. Source-side monolingual data can be (forward-)translated into the target language for self-training; target-side monolingual data can be back-translated. It has been widely reported that back-translation delivers superior results, but could this be due to artefacts in the test sets? We perform a case study using French-English news translation task and separate test sets based on their original languages. We show that forward translation delivers superior gains in terms of BLEU on sentences that were originally in the source language, complementing previous studies which show large improvements with back-translation on sentences that were originally in the target language. To better understand when and why forward and back-translation are effective,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsTest
