Test-Time Scaling Makes Overtraining Compute-Optimal
Nicholas Roberts, Sungjun Cho, Zhiqi Gao, Tzu-Heng Huang, Albert Wu, Gabriel Orlanski, Avi Trost, Kelly Buchanan, Aws Albarghouthi, Frederic Sala

TL;DR
This paper introduces $T^2$ scaling laws that optimize model size, training tokens, and inference samples together, revealing that overtraining can be optimal when considering inference costs in large language models.
Contribution
The paper develops $T^2$ scaling laws that jointly optimize pretraining and test-time inference, extending pretraining scaling laws to modern test-time sampling methods.
Findings
Optimal pretraining decisions shift into overtraining regime when accounting for inference cost.
Pretrained overtrained models outperform standard scaling models in downstream tasks.
$T^2$ scaling laws remain effective during post-training of large language models.
Abstract
Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test () scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. modernizes pretraining scaling laws with pass@ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
