DAGAM: Data Augmentation with Generation And Modification

Byeong-Cheol Jo; Tak-Sung Heo; Yeongjoon Park; Yongmin Yoo; Won Ik; Cho; Kyungsun Kim

arXiv:2204.02633·cs.CL·April 7, 2022

DAGAM: Data Augmentation with Generation And Modification

Byeong-Cheol Jo, Tak-Sung Heo, Yeongjoon Park, Yongmin Yoo, Won Ik, Cho, Kyungsun Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces DAGAM, a combined data augmentation method using generation and modification techniques to improve text classification performance with large language models.

Contribution

It proposes a novel augmentation approach that integrates generation and modification methods, enhancing model performance on benchmark datasets.

Findings

01

DAGAM outperforms original datasets in classification accuracy.

02

Combining generation and modification yields better results than using either alone.

03

The approach reduces underfitting in large-scale language models.

Abstract

Text classification is a representative downstream task of natural language processing, and has exhibited excellent performance since the advent of pre-trained language models based on Transformer architecture. However, in pre-trained language models, under-fitting often occurs due to the size of the model being very large compared to the amount of available training data. Along with significant importance of data collection in modern machine learning paradigm, studies have been actively conducted for natural language data augmentation. In light of this, we introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models. Primarily we use a generation model for data augmentation, which is defined as Data Augmentation with Generation (DAG). Next, we augment data using text modification techniques such as corruption and word order change…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HeoTaksung/DAGAM--Data-Augmentation-with-Generation-And-Modification
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Dropout · Absolute Position Encodings · Layer Normalization · Label Smoothing · Softmax · Adam · Residual Connection · Byte Pair Encoding