Low Resource Text Classification with ULMFit and Backtranslation

Sam Shleifer

arXiv:1903.09244·cs.CL·March 27, 2019·43 cites

Low Resource Text Classification with ULMFit and Backtranslation

Sam Shleifer

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that backtranslation significantly enhances low-resource text classification performance with ULMFit, while random token perturbations do not, and explores test-time augmentation and ensembling for further gains.

Contribution

It introduces the use of backtranslation as an effective data augmentation technique for low-resource text classification with ULMFit, outperforming other methods.

Findings

01

Backtranslation improves accuracy in low-resource settings.

02

Random token perturbations do not improve performance.

03

Ensembling and test-time augmentation yield small additional gains.

Abstract

In computer vision, virtually every state-of-the-art deep learning system is trained with data augmentation. In text classification, however, data augmentation is less widely practiced because it must be performed before training and risks introducing label noise. We augment the IMDB movie reviews dataset with examples generated by two families of techniques: random token perturbations introduced by Wei and Zou [2019] and backtranslation -- translating to a second language then back to English. In low resource environments, backtranslation generates significant improvement on top of the state of-the-art ULMFit model. A ULMFit model pretrained on wikitext103 and then fine-tuned on only 50 IMDB examples and 500 synthetic examples generated by backtranslation achieves 80.6% accuracy, an 8.1% improvement over the augmentation-free baseline with only 9 minutes of additional training time.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oraby8/TextDataAug
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsDropout · Sigmoid Activation · Tanh Activation · Temporal Activation Regularization · DropConnect · Long Short-Term Memory · Activation Regularization · Discriminative Fine-Tuning · Embedding Dropout · Variational Dropout