Data Generation Using Large Language Models for Text Classification: An Empirical Case Study
Yinheng Li, Rogerio Bonatti, Sara Abdali, Justin Wagle, Kazuhito, Koishida

TL;DR
This paper empirically evaluates how different factors affect the quality of synthetic data generated by Large Language Models for text classification, providing insights and recommendations for effective data generation practices.
Contribution
It offers a systematic empirical analysis of synthetic data generation for text classification using LLMs, highlighting key factors influencing data quality and suggesting best practices.
Findings
Prompt choice significantly impacts data quality
Task complexity affects synthetic data usefulness
Diverse and high-quality data improve classifier performance
Abstract
Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies
MethodsFocus
