Balancing Cost and Effectiveness of Synthetic Data Generation Strategies   for LLMs

Yung-Chieh Chan; George Pu; Apaar Shanker; Parth Suresh; Penn Jenks,; John Heyer; Sam Denton

arXiv:2409.19759·cs.CL·October 31, 2024

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Yung-Chieh Chan, George Pu, Apaar Shanker, Parth Suresh, Penn Jenks,, John Heyer, Sam Denton

PDF

Open Access

TL;DR

This paper evaluates various synthetic data generation strategies for fine-tuning large language models, highlighting how their effectiveness varies with resource constraints and task specifics, and providing a practical framework for strategy selection.

Contribution

It introduces a systematic comparison of synthetic data methods and offers a practical framework for choosing optimal strategies based on resource and task considerations.

Findings

01

Answer augmentation is most effective with low seed instruction set size.

02

Generating new questions becomes optimal as the query budget increases.

03

Strategy choice impacts model performance more in low to mid data regimes.

Abstract

As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Scientific Computing and Data Management · Simulation Techniques and Applications

MethodsSparse Evolutionary Training