Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum
Zihui Song, Shihao Ji, Hongxi Li, Shuaizhi Cheng, Chunlin Huang

TL;DR
This paper proposes that data scaling laws in language models are governed by progressive coverage of a latent predictive contribution spectrum, supported by empirical analysis across multiple corpora.
Contribution
It introduces a spectrum-based framework for understanding data scaling laws, linking tail behavior to model performance improvements.
Findings
Tail slope of the spectrum correlates with data-scaling exponent.
Log of effective truncation rank K(N) is nearly linear in log N.
Residual tail mass of the spectrum tracks remaining excess loss.
Abstract
We investigate the hypothesis that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than by token-frequency tails alone. We work with a suffix-automaton representation of text corpora and define a data-intrinsic global-KL predictive contribution spectrum, in which each state contributes according to its empirical mass times its KL deviation from a global next-token baseline. Across 12 real corpora, the tail slope of this spectrum is already strongly correlated with the empirical data-scaling exponent of a fixed small GPT learner. We then go beyond slope correlation and define, for each training size N, an effective truncation rank K(N) by matching the observed excess loss to the residual tail mass of the prepared 1000k global-KL spectrum. Empirically, log K is close to linear in log N, with pooled R^2 about 0.96 for the raw…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
