Back Translation Survey for Improving Text Augmentation
Matthew Ciolino, David Noever, Josh Kalin

TL;DR
This paper surveys the use of back translation as a data augmentation technique in NLP, analyzing the impact of 108 language pairs on model performance and text embeddings to improve training data diversity.
Contribution
It provides a comprehensive analysis of how various back translation languages affect NLP model performance and embedding quality, offering insights for better data augmentation strategies.
Findings
Different language pairs significantly impact augmentation effectiveness
Back translation improves model generalization across multiple metrics
Certain languages yield more diverse and robust augmented data
Abstract
Natural Language Processing (NLP) relies heavily on training data. Transformers, as they have gotten bigger, have required massive amounts of training data. To satisfy this requirement, text augmentation should be looked at as a way to expand your current dataset and to generalize your models. One text augmentation we will look at is translation augmentation. We take an English sentence and translate it to another language before translating it back to English. In this paper, we look at the effect of 108 different language back translations on various metrics and text embeddings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
