Back Translation Survey for Improving Text Augmentation

Matthew Ciolino; David Noever; Josh Kalin

arXiv:2102.09708·cs.CL·November 17, 2022·1 cites

Back Translation Survey for Improving Text Augmentation

Matthew Ciolino, David Noever, Josh Kalin

PDF

Open Access

TL;DR

This paper surveys the use of back translation as a data augmentation technique in NLP, analyzing the impact of 108 language pairs on model performance and text embeddings to improve training data diversity.

Contribution

It provides a comprehensive analysis of how various back translation languages affect NLP model performance and embedding quality, offering insights for better data augmentation strategies.

Findings

01

Different language pairs significantly impact augmentation effectiveness

02

Back translation improves model generalization across multiple metrics

03

Certain languages yield more diverse and robust augmented data

Abstract

Natural Language Processing (NLP) relies heavily on training data. Transformers, as they have gotten bigger, have required massive amounts of training data. To satisfy this requirement, text augmentation should be looked at as a way to expand your current dataset and to generalize your models. One text augmentation we will look at is translation augmentation. We take an English sentence and translate it to another language before translating it back to English. In this paper, we look at the effect of 108 different language back translations on various metrics and text embeddings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques