How Effective is Task-Agnostic Data Augmentation for Pretrained   Transformers?

Shayne Longpre; Yu Wang; Christopher DuBois

arXiv:2010.01764·cs.LG·October 6, 2020

How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?

Shayne Longpre, Yu Wang, Christopher DuBois

PDF

TL;DR

This study systematically evaluates task-agnostic data augmentation methods on pretrained transformers across multiple NLP classification tasks, revealing limited or no consistent performance gains, especially with large pretrained models.

Contribution

It provides a comprehensive empirical analysis of the effectiveness of common data augmentation techniques on pretrained transformers in NLP.

Findings

01

Data augmentation techniques do not consistently improve pretrained transformer performance.

02

Pretrained models like BERT, XLNet, RoBERTa show limited gains from augmentation.

03

Effectiveness of augmentation is context-dependent and often minimal.

Abstract

Task-agnostic forms of data augmentation have proven widely effective in computer vision, even on pretrained models. In NLP similar results are reported most commonly for low data regimes, non-pretrained models, or situationally for pretrained models. In this paper we ask how effective these techniques really are when applied to pretrained transformers. Using two popular varieties of task-agnostic data augmentation (not tailored to any particular task), Easy Data Augmentation (Wei and Zou, 2019) and Back-Translation (Sennrichet al., 2015), we conduct a systematic examination of their effects across 5 classification tasks, 6 datasets, and 3 variants of modern pretrained transformers, including BERT, XLNet, and RoBERTa. We observe a negative result, finding that techniques which previously reported strong improvements for non-pretrained models fail to consistently improve performance for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Dense Connections · Layer Normalization · Byte Pair Encoding · WordPiece · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · SentencePiece · Attention Dropout