Scaling Laws for Mixture Pretraining Under Data Constraints
Anastasiia Sedova, Skyler Seto, Natalie Schluter, Pierre Ablin

TL;DR
This paper investigates how to optimally mix scarce target data with abundant generic data for language model pretraining, revealing the importance of repetition and introducing a scaling law for better mixture configuration.
Contribution
It introduces a repetition-aware scaling law that guides optimal data mixture strategies, improving pretraining efficiency under data constraints.
Findings
Repetition significantly boosts target-domain performance.
Mixture training tolerates 15-20 times target data reuse.
The scaling law enables principled mixture configuration.
Abstract
As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
