Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training

Oleksiy Ostapenko; Charles Guille-Escuret; Luke Kumar; Max Tian; Denis Kocetkov; Gopeshh Subbaraj; Raymond Li; Joel Lamy-Poirier; Sebastien Paquet; Torsten Scholak

arXiv:2507.22250·cs.LG·July 31, 2025

Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training

Oleksiy Ostapenko, Charles Guille-Escuret, Luke Kumar, Max Tian, Denis Kocetkov, Gopeshh Subbaraj, Raymond Li, Joel Lamy-Poirier, Sebastien Paquet, Torsten Scholak

PDF

TL;DR

This paper presents a framework that uses scaling laws derived from multiple training runs to estimate data source utility, enabling cost-effective domain-specific model pre-training decisions.

Contribution

It extends point estimate methods to scaling law estimation, addressing rank invariance issues and improving data source evaluation for resource allocation.

Findings

01

Scaling laws can predict data source performance across compute levels.

02

Multiple annealing runs provide more reliable data source utility estimates.

03

The approach improves domain-specific model pre-training efficiency.

Abstract

We introduce a framework for optimizing domain-specific dataset construction in foundation model training. Specifically, we seek a cost-efficient way to estimate the quality of data sources (e.g. synthetically generated or filtered web data, etc.) in order to make optimal decisions about resource allocation for data sourcing from these sources for the stage two pre-training phase, aka annealing, with the goal of specializing a generalist pre-trained model to specific domains. Our approach extends the usual point estimate approaches, aka micro-annealing, to estimating scaling laws by performing multiple annealing runs of varying compute spent on data curation and training. This addresses a key limitation in prior work, where reliance on point estimates for data scaling decisions can be misleading due to the lack of rank invariance across compute scales -- a phenomenon we confirm in our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.