An Information-Theoretic Criterion for Efficient Data Synthesis
Hanyu Li, Zhengqi Sun, Xiaotie Deng

TL;DR
This paper introduces an information-theoretic framework explaining when synthetic data enhances model training, emphasizing the importance of external signals and supervision granularity for effective and generalizable learning.
Contribution
It provides a novel theoretical account of synthetic data effectiveness, highlighting the role of information openness and supervision granularity in model training outcomes.
Findings
Synthetic data improves models only when the generation-training loop is information-open.
Coarser supervision signals lead to better generalization across tasks.
Learning converges to the most information-efficient signal component, which can cause reward hacking.
Abstract
Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. We provide an information-theoretic account of this inconsistency: synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model's current distribution. When the loop is information-closed (relying on the model's own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied to any particular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
