A Survey on Data Augmentation for Text Classification

Markus Bayer; Marc-Andr\'e Kaufhold; Christian Reuter

arXiv:2107.03158·cs.CL·September 9, 2022

A Survey on Data Augmentation for Text Classification

Markus Bayer, Marc-Andr\'e Kaufhold, Christian Reuter

PDF

TL;DR

This survey comprehensively reviews over 100 data augmentation methods for text classification, categorizing them into 12 groups, and discusses their applications, effectiveness, and future research directions.

Contribution

It provides a detailed taxonomy of existing data augmentation techniques for textual classification and highlights promising methods with state-of-the-art references.

Findings

01

Over 100 methods categorized into 12 groups

02

Identification of highly promising augmentation techniques

03

Discussion of future research directions

Abstract

Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing a model's generalization capabilities, it can also address many other challenges and problems, from overcoming a limited amount of training data, to regularizing the objective, to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation and a taxonomy for existing works, this survey is concerned with data augmentation methods for textual classification and aims to provide a concise and comprehensive overview for researchers and practitioners. Derived from the taxonomy, we divide more than 100 methods into 12 different groupings and give state-of-the-art references expounding which methods are highly promising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.