Synthetic Data Generation with Large Language Models for Text   Classification: Potential and Limitations

Zhuoyan Li; Hangxiao Zhu; Zhuoran Lu; Ming Yin

arXiv:2310.07849·cs.CL·October 16, 2023·6 cites

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin

PDF

Open Access

TL;DR

This paper investigates the potential and limitations of using large language models to generate synthetic data for text classification, focusing on how subjectivity affects model performance.

Contribution

It provides an analysis of how subjectivity influences the effectiveness of LLM-generated synthetic data in text classification tasks.

Findings

01

Subjectivity negatively impacts model performance with synthetic data.

02

Model performance varies across different classification tasks.

03

Synthetic data can be beneficial but has limitations depending on task subjectivity.

Abstract

The collection and curation of high-quality training data is crucial for developing text classification models with superior performance, but it is often associated with significant costs and time investment. Researchers have recently explored using large language models (LLMs) to generate synthetic datasets as an alternative approach. However, the effectiveness of the LLM-generated synthetic data in supporting model training is inconsistent across different classification tasks. To better understand factors that moderate the effectiveness of the LLM-generated synthetic data, in this study, we look into how the performance of models trained on these synthetic data may vary with the subjectivity of classification. Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods