Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation
Joshua Shay Kricheli, Alexander Lawrence Reid, Soumajyoti Sarkar, Venkata Gandikota, Paulo Shakarian

TL;DR
This paper demonstrates that using a fixed tokens-per-parameter ratio in neural scaling law studies causes ill-conditioning, leading to unreliable extrapolations, and proposes a solution based on TPP diversity.
Contribution
It identifies the ill-conditioning problem caused by collinear design in scaling law estimation and derives a TPP-diversity threshold for better-conditioned, more reliable models.
Findings
Non-collinear designs outperform collinear ones with a 97.3% win rate.
Ill-conditioning causes confidence intervals to inflate and model extrapolations to degrade.
The degeneracy is rooted in Jacobian geometry and affects any smooth estimation objective.
Abstract
Neural scaling laws approximate a language model's loss as a power-law function of parameter count and token count . Following Chinchilla-style compute-optimal training, many studies fit scaling laws from runs performed under a fixed tokens-per-parameter (TPP) ratio and set . We show that this collinear design, combined with the empirically common near-equality of the exponents governing and , induces an inherent ill-conditioning in the Gauss-Newton least-squares problem: the condition number of the design grows as the inverse square of the gap between the and -exponents. The scale coefficients become practically unidentifiable, with confidence intervals inflating by an order of magnitude or more, yielding a ``sloppy'' model whose extrapolations degrade sharply off the training ray. We prove this for four scaling-law formalisms and derive a closed-form…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
