DAGAM: Data Augmentation with Generation And Modification
Byeong-Cheol Jo, Tak-Sung Heo, Yeongjoon Park, Yongmin Yoo, Won Ik, Cho, Kyungsun Kim

TL;DR
This paper introduces DAGAM, a combined data augmentation method using generation and modification techniques to improve text classification performance with large language models.
Contribution
It proposes a novel augmentation approach that integrates generation and modification methods, enhancing model performance on benchmark datasets.
Findings
DAGAM outperforms original datasets in classification accuracy.
Combining generation and modification yields better results than using either alone.
The approach reduces underfitting in large-scale language models.
Abstract
Text classification is a representative downstream task of natural language processing, and has exhibited excellent performance since the advent of pre-trained language models based on Transformer architecture. However, in pre-trained language models, under-fitting often occurs due to the size of the model being very large compared to the amount of available training data. Along with significant importance of data collection in modern machine learning paradigm, studies have been actively conducted for natural language data augmentation. In light of this, we introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models. Primarily we use a generation model for data augmentation, which is defined as Data Augmentation with Generation (DAG). Next, we augment data using text modification techniques such as corruption and word order change…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Dropout · Absolute Position Encodings · Layer Normalization · Label Smoothing · Softmax · Adam · Residual Connection · Byte Pair Encoding
