Exploring Diversity in Back Translation for Low-Resource Machine Translation
Laurie Burchell, Alexandra Birch, Kenneth Heafield

TL;DR
This paper introduces a nuanced framework for measuring lexical and syntactic diversity in back translation, demonstrating that higher diversity, especially lexical, improves low-resource neural machine translation.
Contribution
It proposes new metrics for diversity, analyzes their impact on translation quality, and shows nucleus sampling enhances diversity and performance in low-resource settings.
Findings
Nucleus sampling yields higher translation performance.
Lexical diversity is more crucial than syntactic diversity.
Diversity metrics correlate with improved translation quality.
Abstract
Back translation is one of the most widely used methods for improving the performance of neural machine translation systems. Recent research has sought to enhance the effectiveness of this method by increasing the 'diversity' of the generated translations. We argue that the definitions and metrics used to quantify 'diversity' in previous work have been insufficient. This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity. We present novel metrics for measuring these different aspects of diversity and carry out empirical analysis into the effect of these types of diversity on final neural machine translation model performance for low-resource EnglishTurkish and mid-resource EnglishIcelandic. Our findings show that generating back translation using nucleus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
