From Boltzmann to Zipf through Shannon and Jaynes
Alvaro Corral, Montserrat Garcia del Muro

TL;DR
This paper models word-frequency distributions using a statistical physics approach, deriving Zipf's law from maximum entropy principles and pairwise letter interactions, revealing underlying statistical laws in language data.
Contribution
It introduces a maximum-entropy framework based on letter interactions to explain word frequencies and Zipf's law in language, connecting linguistic patterns with statistical physics.
Findings
The model reproduces Zipf's law with some limitations.
Empirical two-letter marginals follow statistical laws.
Interaction potentials exhibit well-defined statistical distributions.
Abstract
The word-frequency distribution provides the fundamental building blocks that generate discourse in language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf's law, at least approximately. Following Stephens and Bialek [Phys. Rev. E 81, 066119, 2010], we interpret the frequency of any word as arising from the interaction potential between its constituent letters. Indeed, Jaynes' maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of all pairwise (two-letter) potentials. The improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. Appling this formalism to words with up to six letters from the English subset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
