Data Generation Using Large Language Models for Text Classification: An   Empirical Case Study

Yinheng Li; Rogerio Bonatti; Sara Abdali; Justin Wagle; Kazuhito; Koishida

arXiv:2407.12813·cs.CL·July 23, 2024·2 cites

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Yinheng Li, Rogerio Bonatti, Sara Abdali, Justin Wagle, Kazuhito, Koishida

PDF

Open Access

TL;DR

This paper empirically evaluates how different factors affect the quality of synthetic data generated by Large Language Models for text classification, providing insights and recommendations for effective data generation practices.

Contribution

It offers a systematic empirical analysis of synthetic data generation for text classification using LLMs, highlighting key factors influencing data quality and suggesting best practices.

Findings

01

Prompt choice significantly impacts data quality

02

Task complexity affects synthetic data usefulness

03

Diverse and high-quality data improve classifier performance

Abstract

Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies

MethodsFocus