Transplant Then Regenerate: A New Paradigm for Text Data Augmentation
Guangzhan Wang, Hongyu Zhang, Beijun Shen, Xiaodong Gu

TL;DR
This paper introduces LMTransplant, a novel text data augmentation method using large language models that enhances diversity and content creativity while maintaining original attributes, outperforming existing methods.
Contribution
The paper proposes LMTransplant, a new paradigm that leverages LLMs for more diverse and attribute-preserving text augmentation through a transplant-then-regenerate approach.
Findings
LMTransplant outperforms existing augmentation methods.
It scales effectively with larger augmented datasets.
Demonstrates superior performance across various text tasks.
Abstract
Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their "knowledge emergence" capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Service-Oriented Architecture and Web Services
