Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations
Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin

TL;DR
This paper investigates the potential and limitations of using large language models to generate synthetic data for text classification, focusing on how subjectivity affects model performance.
Contribution
It provides an analysis of how subjectivity influences the effectiveness of LLM-generated synthetic data in text classification tasks.
Findings
Subjectivity negatively impacts model performance with synthetic data.
Model performance varies across different classification tasks.
Synthetic data can be beneficial but has limitations depending on task subjectivity.
Abstract
The collection and curation of high-quality training data is crucial for developing text classification models with superior performance, but it is often associated with significant costs and time investment. Researchers have recently explored using large language models (LLMs) to generate synthetic datasets as an alternative approach. However, the effectiveness of the LLM-generated synthetic data in supporting model training is inconsistent across different classification tasks. To better understand factors that moderate the effectiveness of the LLM-generated synthetic data, in this study, we look into how the performance of models trained on these synthetic data may vary with the subjectivity of classification. Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
