Large Language Model as Attributed Training Data Generator: A Tale of   Diversity and Bias

Yue Yu; Yuchen Zhuang; Jieyu Zhang; Yu Meng; Alexander Ratner; Ranjay; Krishna; Jiaming Shen; Chao Zhang

arXiv:2306.15895·cs.CL·October 19, 2023·72 cites

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay, Krishna, Jiaming Shen, Chao Zhang

PDF

Open Access 1 Repo 4 Datasets

TL;DR

This paper explores using diversely attributed prompts with large language models to generate more diverse and less biased training data for NLP, improving model performance efficiently compared to simple prompts.

Contribution

It introduces attributed prompts for data generation, demonstrating their superiority over simple prompts in diversity, bias reduction, and cost efficiency in NLP training.

Findings

01

Attributed prompts reduce regional bias in generated data.

02

Diverse attributes improve model performance.

03

Attributed prompts require only 5% of the querying cost of simple prompts.

Abstract

Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. Additionally, we present a comprehensive empirical study on data generation encompassing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yueyu1030/attrprompt
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification