Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping

Pu Yang; Yunzhen Feng; Ziyuan Chen; Yuhang Wu; Zhuoyuan Li

arXiv:2501.18962·cs.LG·October 17, 2025

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping

Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper develops a theoretical framework to optimize budget allocation in iterative synthetic data bootstrapping, demonstrating that increasing policies like exponential growth outperform constant strategies in improving model performance.

Contribution

It introduces a novel theoretical analysis of budget strategies in iterative bootstrapping, highlighting the advantages of increasing policies over constant ones.

Findings

01

Exponential growth policies outperform constant policies in synthetic data bootstrapping.

02

Increasing policies lead to more stable and higher model performance.

03

Theoretical analysis confirms the benefits of exponential and polynomial growth strategies.

Abstract

Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model performance improves, raising a crucial question: How should the total budget for generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework for analyzing budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies -- particularly exponential growth policies -- exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zylipku/IterativeImaging
noneOfficial

Videos

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping· slideslive

Taxonomy

TopicsMachine Learning and Data Classification · Machine Learning in Healthcare · Neural Networks and Applications

MethodsDiffusion