Excess entropy in natural language: present state and perspectives
{\L}ukasz D\k{e}bowski

TL;DR
This paper reviews recent advances in understanding mutual information and excess entropy in natural language, highlighting power-law behaviors and their implications for language structure and evolution.
Contribution
It synthesizes mathematical results on mutual information growth in texts and discusses their relevance to human communication and language complexity.
Findings
Power-law distribution of word frequencies (Herdan's law).
Mutual information growth follows a power-law in texts.
Implications for understanding language complexity and evolution.
Abstract
We review recent progress in understanding the meaning of mutual information in natural language. Let us define words in a text as strings that occur sufficiently often. In a few previous papers, we have shown that a power-law distribution for so defined words (a.k.a. Herdan's law) is obeyed if there is a similar power-law growth of (algorithmic) mutual information between adjacent portions of texts of increasing length. Moreover, the power-law growth of information holds if texts describe a complicated infinite (algorithmically) random object in a highly repetitive way, according to an analogous power-law distribution. The described object may be immutable (like a mathematical or physical constant) or may evolve slowly in time (like cultural heritage). Here we reflect on the respective mathematical results in a less technical way. We also discuss feasibility of deciding to what extent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
