AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

Ishika Agarwal; Sofia Stoica; Emre Can Acikgoz; Pradeep Natarajan; Mahdi Namazifar; Jiaqi Ma; Dilek Hakkani-T\"ur

arXiv:2605.13149·cs.CL·May 14, 2026

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

Ishika Agarwal, Sofia Stoica, Emre Can Acikgoz, Pradeep Natarajan, Mahdi Namazifar, Jiaqi Ma, Dilek Hakkani-T\"ur

PDF

TL;DR

AcquisitionSynthesis introduces a method where acquisition functions guide language models to generate high-quality synthetic data, improving downstream task performance and robustness.

Contribution

The paper presents a novel approach using acquisition functions as reward signals to train language models for targeted data generation.

Findings

01

Models trained with AcquisitionSynthesis data achieve 2-7% performance gains.

02

AcquisitionSynthesis enhances robustness to catastrophic forgetting.

03

Generated data is effective across different models and resource settings.

Abstract

Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.