Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models
Tyler Bell, Avinash Mudireddy, Ivan Johnson-Eversoll, Soura Dasgupta, Raghu Mudumbai

TL;DR
This paper establishes that the perplexity of long texts generated by language models converges to the average entropy, defining a typical set of outputs and revealing strong constraints on model behavior.
Contribution
It proves a new asymptotic property for perplexity in language models and refines the typical set concept to include only grammatically correct texts, with practical implications.
Findings
Perplexity converges to average entropy for long texts.
Refined typical set includes only grammatically correct texts.
Language models are highly constrained in their possible outputs.
Abstract
We prove a new asymptotic un-equipartition property for the perplexity of long texts generated by a language model and present supporting experimental evidence from open-source models. Specifically we show that the logarithmic perplexity of any large text generated by a language model must asymptotically converge to the average entropy of its token distributions. This defines a ``typical set'' that all long synthetic texts generated by a language model must belong to. We refine the concept of ''typical set'' to include only grammatically correct texts. We then show that this refined typical set is a vanishingly small subset of all possible grammatically correct texts for a very general definition of grammar. This means that language models are strongly constrained in the range of their possible behaviors and outputs. We make no simplifying assumptions (such as stationarity) about the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · DNA and Biological Computing · Topic Modeling
MethodsSparse Evolutionary Training
