Large language models and the entropy of English
Colin Scheibner, Lindsay M. Smith, William Bialek

TL;DR
This paper demonstrates that large language models reveal long-range dependencies in English texts, with decreasing entropy over extended contexts, indicating complex, distant interactions and gradual learning of long-range structure.
Contribution
It uncovers long-range dependencies in English texts using LLMs, showing that these models capture interactions over thousands of characters and that long-range structure is learned gradually.
Findings
Entropy decreases with context length up to 10,000 characters.
Significant correlations exist between characters separated by large distances.
Long-range structure is learned gradually during model training.
Abstract
We use large language models (LLMs) to uncover long-ranged structure in English texts from a variety of sources. The conditional entropy or code length in many cases continues to decrease with context length at least to characters, implying that there are direct dependencies or interactions across these distances. A corollary is that there are small but significant correlations between characters at these separations, as we show from the data independent of models. The distribution of code lengths reveals an emergent certainty about an increasing fraction of characters at large . Over the course of model training, we observe different dynamics at long and short context lengths, suggesting that long-ranged structure is learned only gradually. Our results constrain efforts to build statistical physics models of LLMs or language itself.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Text Readability and Simplification · Language and cultural evolution
