Data Augmentation using Pre-trained Transformer Models
Varun Kumar, Ashutosh Choudhary, Eunah Cho

TL;DR
This paper investigates how different transformer-based pre-trained models can be used for data augmentation in NLP, demonstrating that class label prepending and Seq2Seq models improve low-resource classification performance.
Contribution
It introduces a simple method of conditioning pre-trained models with class labels for data augmentation and compares various transformer architectures across multiple benchmarks.
Findings
Seq2Seq models outperform other methods in low-resource settings.
Prepending class labels effectively conditions models for augmentation.
Data augmentation with these methods increases data diversity and preserves label information.
Abstract
Language model based pre-trained models such as BERT have provided significant gains across different NLP tasks. In this paper, we study different types of transformer based pre-trained models such as auto-regressive models (GPT-2), auto-encoder models (BERT), and seq2seq models (BART) for conditional data augmentation. We show that prepending the class labels to text sequences provides a simple yet effective way to condition the pre-trained models for data augmentation. Additionally, on three classification benchmarks, pre-trained Seq2Seq model outperforms other data augmentation methods in a low-resource setting. Further, we explore how different pre-trained model based data augmentation differs in-terms of data diversity, and how well such methods preserve the class-label information.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Sigmoid Activation · Tanh Activation · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding
