Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Reyhane Askari-Hemmat; Mohammad Pezeshki; Elvis Dohmatob; Florian; Bordes; Pietro Astolfi; Melissa Hall; Jakob Verbeek; Michal Drozdzal; Adriana; Romero-Soriano

arXiv:2502.15588·cs.LG·February 24, 2025

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian, Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana, Romero-Soriano

PDF

1 Video

TL;DR

This paper introduces a novel synthetic data generation framework inspired by deliberate practice, which enhances model scaling efficiency by focusing on informative samples, reducing data and training requirements while improving performance.

Contribution

The paper presents a new framework called Deliberate Practice for Synthetic Data Generation (DP) that improves data efficiency and scaling laws by focusing on challenging, informative synthetic samples.

Findings

01

DP generates fewer samples and requires fewer iterations than prior methods.

02

DP achieves superior performance on ImageNet datasets.

03

Theoretically and empirically demonstrates improved scaling laws with DP.

Abstract

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving the Scaling Laws of Synthetic Data with Deliberate Practice· slideslive