Deriving Neural Scaling Laws from the statistics of natural language
Francesco Cagnetta, Allan Ravent\'os, Surya Ganguli, Matthieu Wyart

TL;DR
This paper develops a theoretical framework that predicts neural scaling laws for large language models based on natural language statistics, matching empirical results without free parameters.
Contribution
It introduces the first theory to quantitatively predict data-limited neural scaling exponents from language statistics, without relying on synthetic data models.
Findings
The theory accurately predicts scaling exponents for GPT-2 and LLaMA models.
Language statistics like token correlation decay and conditional entropy are key predictors.
Predictions match experimental results on TinyStories and WikiText datasets.
Abstract
Despite the fact that experimental neural scaling laws have substantially guided empirical progress in large-scale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Explainable Artificial Intelligence (XAI)
