Semantic Chunking and the Entropy of Natural Language
Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks

TL;DR
This paper presents a hierarchical statistical model that captures the multi-scale semantic structure of natural language, explaining its entropy rate and redundancy, and aligning with empirical estimates from large language models.
Contribution
The authors introduce a novel multi-scale semantic segmentation model that analytically accounts for the entropy and redundancy in natural language, linking semantic complexity to entropy rate.
Findings
Model accurately predicts the entropy rate of English.
Entropy rate increases with semantic complexity of the corpus.
Quantitative agreement with modern language models' estimates.
Abstract
The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Natural Language Processing Techniques
