Prescriptive Scaling Laws for Data Constrained Training
Justin Lovelace, Christian Belardi, Srivatsa Kundurthy, Shriya Sudhakar, Kilian Q. Weinberger

TL;DR
This paper introduces a new scaling law for data-constrained training that accounts for overfitting due to repetition, providing better guidance for resource allocation and model configuration.
Contribution
It models excess loss with an additive overfitting penalty, enabling improved compute allocation advice and comparison across training setups.
Findings
Repetition causes overfitting, which can be modeled additively.
Following the new law's guidance improves performance in data-limited regimes.
Strong weight decay significantly reduces overfitting, aligning with the new scaling law.
Abstract
Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
