Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Jiachen T. Wang; Tong Wu; Kaifeng Lyu; James Zou; Dawn Song; Ruoxi Jia; Prateek Mittal

arXiv:2512.24503·cs.LG·April 14, 2026

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Jiachen T. Wang, Tong Wu, Kaifeng Lyu, James Zou, Dawn Song, Ruoxi Jia, Prateek Mittal

PDF

1 Video

TL;DR

This paper investigates the reliability of using small proxy models for data curation decisions in AI training, highlighting issues with fixed hyperparameters and proposing a tuning method that better predicts full-scale results.

Contribution

It identifies a critical flaw in current small-scale evaluation protocols and introduces a hyperparameter tuning approach that improves correlation with full-scale training outcomes.

Findings

01

Fixed hyperparameters can lead to misleading conclusions about data quality.

02

Using reduced learning rates in proxy models improves correlation with large-scale training results.

03

The proposed method is validated across 23 data recipes and four data curation dimensions.

Abstract

Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training runs. However, the community has a limited understanding of whether and when conclusions drawn from small-scale experiments reliably transfer to full-scale model training. In this work, we uncover a subtle yet critical issue in the standard experimental protocol for data recipe assessment: the use of identical small-scale model training configurations across all data recipes in the name of "fair" comparison. We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyperparameters, as the optimal training configuration is inherently data-dependent. Moreover, this fixed-configuration protocol diverges from full-scale model development pipelines, where hyperparameter optimization is a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice· slideslive