Scaling Laws for Mixture Pretraining Under Data Constraints

Anastasiia Sedova; Skyler Seto; Natalie Schluter; Pierre Ablin

arXiv:2605.12715·cs.LG·May 18, 2026

Scaling Laws for Mixture Pretraining Under Data Constraints

Anastasiia Sedova, Skyler Seto, Natalie Schluter, Pierre Ablin

PDF

TL;DR

This paper investigates how to optimally mix scarce target data with abundant generic data for language model pretraining, revealing the importance of repetition and introducing a scaling law for better mixture configuration.

Contribution

It introduces a repetition-aware scaling law that guides optimal data mixture strategies, improving pretraining efficiency under data constraints.

Findings

01

Repetition significantly boosts target-domain performance.

02

Mixture training tolerates 15-20 times target data reuse.

03

The scaling law enables principled mixture configuration.

Abstract

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.