An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models
Anuj K. Nayak, Lav R. Varshney

TL;DR
This paper introduces a unified information-theoretic framework explaining size scaling, emergence, and plateaus in language models, linking concepts from coding theory and network analysis.
Contribution
It presents a novel mathematical model that unifies the understanding of language model scaling phenomena, including the Chinchilla rule and performance plateaus.
Findings
Derives the compute-optimal size scaling (Chinchilla rule) using LDPC decoding analogies.
Provides a simple explanation for the emergence of complex skills in language models.
Explains the occurrence of multiple performance plateaus as models scale.
Abstract
Recent empirical studies show three phenomena with increasing size of language models: compute-optimal size scaling, emergent capabilities, and performance plateauing. We present a simple unified mathematical framework to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning. Modeling the learning of concepts from texts as an iterative process yields an analogy to iterative decoding of low-density parity check (LDPC) codes in information theory. Thence, drawing on finite-size scaling characterizations of LDPC decoding, we derive the compute-optimal size scaling (Chinchilla rule) for language models. Further, using tools from random network theory, we provide a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale. We see multiple plateaus.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Language and cultural evolution
