Towards Active Synthetic Data Generation for Finetuning Language Models
Samuel Kessler, Menglin Xia, Daniel Madrigal Diaz, Dongge Han, Helia Heshemi, Saravan Rajmohan, Victor Ruehle, Jordan T. Ash

TL;DR
This paper demonstrates that iterative, active data generation guided by the current student model improves finetuning outcomes for language models, outperforming static synthetic data approaches.
Contribution
It introduces an active, iterative synthetic data generation method for finetuning language models, showing its effectiveness over static data generation and highlighting simple active learning criteria.
Findings
Active, iterative data generation improves model performance.
Simple selection criteria from active learning are most effective.
Validated across multiple datasets and models.
Abstract
A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection…
Peer Reviews
Decision·Submitted to ICLR 2026
The conclusion that simple data selection methods, such as prioritizing hard samples with high loss, often outperform complicated and expensive LLM-as-a-judge based methods is a useful result for practitioners, suggesting that resource-intensive scoring is not always necessary. The use of learning curves and the pairwise win-rate matrix (Figure 4) provides a structured comparative analysis focused on the core concept of data efficiency. Analysis is thorough, covering four distinct reasoning da
The paper claims to provide a "benchmark study for iterative synthetic data generation" but fails to run a head-to-head comparison against the actual selection methods proposed by the most relevant prior works, specifically LLM2LLM (Lee et al., 2024) and the full LION (Jiang et al., 2023c) strategy. The critical "incorrect student answers" criterion from LLM2LLM is relegated to the appendix (C.1) despite being a highly competitive baseline in a truly active synthetic data setting. Limitations i
- The paper provides a clear and unified experimental framework for iterative synthetic data generation, grounding the idea in active learning principles. - Empirical results are extensive: four datasets, four SLMs, multiple scoring algorithms, and comparisons to static generation. - The conclusion—that simple, low-cost criteria (e.g., high student loss) outperform expensive LLM-as-a-judge scoring—is practical and well-supported. - The ablation on selection design choices (argmax vs. sampling, u
- Conceptually, the idea of iterative, student-guided data generation is not entirely new. Prior works such as [1,2,3] (especially 1) have explored similar active distillation loops where the student model guides data selection or teacher queries. However, the present paper does not cite or discuss these connections, nor does it clarify what is fundamentally new beyond applying classic active learning heuristics in this context. - The method, while empirically solid, lacks deeper theoretical or
• The authors provide a solid benchmark showing that simple, inexpensive heuristics (e.g., high-loss selection) can outperform more complex LLM-as-a-judge strategies. • The work provides practical guidance for synthetic data generation under constrained compute budgets, which can be valuable for practitioners training SLMs. • The paper is clearly written and easy to reproduce. • It contributes to the empirical understanding of how different data-selection heuristics impact fine-tuning performanc
• Limited novelty: The core idea—iterative, student-aware synthetic data generation—has been explored in multiple prior works. This paper mainly repackages it under the active learning perspective. • Lack of theoretical or conceptual insight: The paper does not explain why the compared heuristics differ or what properties (difficulty, diversity, informativeness) they capture. • Marginal performance gains: Improvements are small or inconsistent. For GSM8K and ProntoQA, performance remains below o
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Algorithms · Text Readability and Simplification
