Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay, Krishna, Jiaming Shen, Chao Zhang

TL;DR
This paper explores using diversely attributed prompts with large language models to generate more diverse and less biased training data for NLP, improving model performance efficiently compared to simple prompts.
Contribution
It introduces attributed prompts for data generation, demonstrating their superiority over simple prompts in diversity, bias reduction, and cost efficiency in NLP training.
Findings
Attributed prompts reduce regional bias in generated data.
Diverse attributes improve model performance.
Attributed prompts require only 5% of the querying cost of simple prompts.
Abstract
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. Additionally, we present a comprehensive empirical study on data generation encompassing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
