Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation

Joshua Shay Kricheli; Alexander Lawrence Reid; Soumajyoti Sarkar; Venkata Gandikota; Paulo Shakarian

arXiv:2605.08541·cs.LG·May 14, 2026

Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation

Joshua Shay Kricheli, Alexander Lawrence Reid, Soumajyoti Sarkar, Venkata Gandikota, Paulo Shakarian

PDF

TL;DR

This paper demonstrates that using a fixed tokens-per-parameter ratio in neural scaling law studies causes ill-conditioning, leading to unreliable extrapolations, and proposes a solution based on TPP diversity.

Contribution

It identifies the ill-conditioning problem caused by collinear design in scaling law estimation and derives a TPP-diversity threshold for better-conditioned, more reliable models.

Findings

01

Non-collinear designs outperform collinear ones with a 97.3% win rate.

02

Ill-conditioning causes confidence intervals to inflate and model extrapolations to degrade.

03

The degeneracy is rooted in Jacobian geometry and affects any smooth estimation objective.

Abstract

Neural scaling laws approximate a language model's loss as a power-law function of parameter count $N$ and token count $D$ . Following Chinchilla-style compute-optimal training, many studies fit scaling laws from runs performed under a fixed tokens-per-parameter (TPP) ratio $k$ and set $D = k N$ . We show that this collinear design, combined with the empirically common near-equality of the exponents governing $N$ and $D$ , induces an inherent ill-conditioning in the Gauss-Newton least-squares problem: the condition number of the design grows as the inverse square of the gap between the $N$ and $D$ -exponents. The scale coefficients become practically unidentifiable, with confidence intervals inflating by an order of magnitude or more, yielding a ``sloppy'' model whose extrapolations degrade sharply off the training ray. We prove this for four scaling-law formalisms and derive a closed-form…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.