Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities
Yaping Chai, Haoran Xie, Joe S. Qin

TL;DR
This survey reviews various data augmentation techniques for large language models, highlighting methods, challenges, and future directions to enhance model training with limited data.
Contribution
It provides a comprehensive classification and analysis of data augmentation methods for LLMs, including recent retrieval-based and hybrid approaches.
Findings
Retrieval-based techniques improve data grounding.
Post-processing refines augmented data.
Hybrid methods combine multiple augmentation strategies.
Abstract
The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation by introducing external knowledge to enable them to produce more grounded-truth data. This survey provides an in-depth analysis of data augmentation in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Recommender Systems and Techniques · Natural Language Processing Techniques
