The Saturation Point of Backtranslation in High Quality Low Resource English Gujarati Machine Translation
Arwa Arif

TL;DR
This study investigates the limits of backtranslation in high-quality low-resource English-Gujarati machine translation, finding that additional synthetic data may not always improve performance and can sometimes reduce it.
Contribution
It provides the first detailed analysis of backtranslation saturation in high-quality low-resource translation, highlighting potential diminishing returns.
Findings
Backtranslation did not improve BLEU scores in the studied setting.
Adding synthetic data sometimes slightly decreased translation quality.
Backtranslation may reach a saturation point in certain low-resource scenarios.
Abstract
Backtranslation BT is widely used in low resource machine translation MT to generate additional synthetic training data using monolingual corpora. While this approach has shown strong improvements for many language pairs, its effectiveness in high quality, low resource settings remains unclear. In this work, we explore the effectiveness of backtranslation for English Gujarati translation using the multilingual pretrained MBART50 model. Our baseline system, trained on a high quality parallel corpus of approximately 50,000 sentence pairs, achieves a BLEU score of 43.8 on a validation set. We augment this data with carefully filtered backtranslated examples generated from monolingual Gujarati text. Surprisingly, adding this synthetic data does not improve translation performance and, in some cases, slightly reduces it. We evaluate our models using multiple metrics like BLEU, ChrF++, TER,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Translation Studies and Practices
