Making Large Language Models Better Data Creators

Dong-Ho Lee; Jay Pujara; Mohit Sewak; Ryen W. White; Sujay Kumar; Jauhar

arXiv:2310.20111·cs.CL·November 1, 2023·1 cites

Making Large Language Models Better Data Creators

Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen W. White, Sujay Kumar, Jauhar

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified, minimal-example data creation pipeline using instruction-following LLMs, significantly reducing labeling effort and improving out-of-distribution performance for NLP models.

Contribution

It proposes a versatile data generation method requiring only one example, enhancing data efficiency and robustness across diverse NLP tasks.

Findings

01

Models trained with generated data outperform human-labeled data by up to 17.5% on out-of-distribution tasks.

02

The method reduces data labeling costs and simplifies task adaptation.

03

Generated data maintains comparable in-distribution performance.

Abstract

Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As such, trainable models are still the preferred option in some cases. However, these models still require human-labeled data for optimal performance, which is expensive and time-consuming to obtain. In order to address this issue, several techniques to reduce human effort involve labeling or generating data using LLMs. Although these methods are effective for certain applications, in practice they encounter difficulties in real-world scenarios. Labeling data requires careful data selection, while generating data necessitates task-specific prompt engineering. In this paper, we propose a unified data creation pipeline that requires only a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/llm-data-creation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems