AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions
Ishika Agarwal, Sofia Stoica, Emre Can Acikgoz, Pradeep Natarajan, Mahdi Namazifar, Jiaqi Ma, Dilek Hakkani-T\"ur

TL;DR
AcquisitionSynthesis introduces a method where acquisition functions guide language models to generate high-quality synthetic data, improving downstream task performance and robustness.
Contribution
The paper presents a novel approach using acquisition functions as reward signals to train language models for targeted data generation.
Findings
Models trained with AcquisitionSynthesis data achieve 2-7% performance gains.
AcquisitionSynthesis enhances robustness to catastrophic forgetting.
Generated data is effective across different models and resource settings.
Abstract
Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
