Semantic Chunking and the Entropy of Natural Language

Weishun Zhong; Doron Sivan; Tankut Can; Mikhail Katkov; Misha Tsodyks

arXiv:2602.13194·cs.CL·February 19, 2026

Semantic Chunking and the Entropy of Natural Language

Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks

PDF

Open Access

TL;DR

This paper presents a hierarchical statistical model that captures the multi-scale semantic structure of natural language, explaining its entropy rate and redundancy, and aligning with empirical estimates from large language models.

Contribution

The authors introduce a novel multi-scale semantic segmentation model that analytically accounts for the entropy and redundancy in natural language, linking semantic complexity to entropy rate.

Findings

01

Model accurately predicts the entropy rate of English.

02

Entropy rate increases with semantic complexity of the corpus.

03

Quantitative agreement with modern language models' estimates.

Abstract

The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Natural Language Processing Techniques