TL;DR
This paper introduces a novel synthetic data generation framework inspired by deliberate practice, which enhances model scaling efficiency by focusing on informative samples, reducing data and training requirements while improving performance.
Contribution
The paper presents a new framework called Deliberate Practice for Synthetic Data Generation (DP) that improves data efficiency and scaling laws by focusing on challenging, informative synthetic samples.
Findings
DP generates fewer samples and requires fewer iterations than prior methods.
DP achieves superior performance on ImageNet datasets.
Theoretically and empirically demonstrates improved scaling laws with DP.
Abstract
Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
