Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation
Marzieh Fadaee, Christof Monz

TL;DR
This paper investigates why back-translation improves neural machine translation and proposes targeted sampling strategies focusing on difficult words and contexts, leading to significant quality improvements.
Contribution
It introduces novel sampling methods that prioritize difficult words and contexts during back-translation, enhancing translation quality over random sampling.
Findings
Up to 1.7 BLEU point improvement for German-English
Up to 1.2 BLEU point improvement for English-German
Targeted sampling outperforms random sampling in back-translation
Abstract
Neural Machine Translation has achieved state-of-the-art performance for several language pairs using a combination of parallel and synthetic data. Synthetic data is often generated by back-translating sentences randomly sampled from monolingual data using a reverse translation model. While back-translation has been shown to be very effective in many cases, it is not entirely clear why. In this work, we explore different aspects of back-translation, and show that words with high prediction loss during training benefit most from the addition of synthetic data. We introduce several variations of sampling strategies targeting difficult-to-predict words using prediction losses and frequencies of words. In addition, we also target the contexts of difficult words and sample sentences that are similar in context. Experimental results for the WMT news translation task show that our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
