Making Large Language Models Better Data Creators
Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen W. White, Sujay Kumar, Jauhar

TL;DR
This paper introduces a unified, minimal-example data creation pipeline using instruction-following LLMs, significantly reducing labeling effort and improving out-of-distribution performance for NLP models.
Contribution
It proposes a versatile data generation method requiring only one example, enhancing data efficiency and robustness across diverse NLP tasks.
Findings
Models trained with generated data outperform human-labeled data by up to 17.5% on out-of-distribution tasks.
The method reduces data labeling costs and simplifies task adaptation.
Generated data maintains comparable in-distribution performance.
Abstract
Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As such, trainable models are still the preferred option in some cases. However, these models still require human-labeled data for optimal performance, which is expensive and time-consuming to obtain. In order to address this issue, several techniques to reduce human effort involve labeling or generating data using LLMs. Although these methods are effective for certain applications, in practice they encounter difficulties in real-world scenarios. Labeling data requires careful data selection, while generating data necessitates task-specific prompt engineering. In this paper, we propose a unified data creation pipeline that requires only a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
