Understanding Zipf's law of word frequencies through sample-space collapse in sentence formation
Stefan Thurner, Rudolf Hanel, Bo Liu, Bernat Corominas-Murtra

TL;DR
This paper presents a simple model linking sample-space reduction during sentence formation to Zipf's law in word frequencies, supported by empirical analysis of English texts and showing how nestedness influences the power-law exponent.
Contribution
It introduces a novel sample-space reducing model that explains Zipf's law through nestedness in word transition structures, without relying on multiplicative or preferential mechanisms.
Findings
Model explains Zipf's law as a consequence of sample-space reduction.
Empirical analysis shows a strong relation between nestedness and power-law exponents.
Deviations from Zipf's law are linked to variations in nestedness across texts.
Abstract
The formation of sentences is a highly structured and history-dependent process. The probability of using a specific word in a sentence strongly depends on the 'history' of word-usage earlier in that sentence. We study a simple history-dependent model of text generation assuming that the sample-space of word usage reduces along sentence formation, on average. We first show that the model explains the approximate Zipf law found in word frequencies as a direct consequence of sample-space reduction. We then empirically quantify the amount of sample-space reduction in the sentences of ten famous English books, by analysis of corresponding word-transition tables that capture which words can follow any given word in a text. We find a highly nested structure in these transition tables and show that this `nestedness' is tightly related to the power law exponents of the observed word frequency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpinion Dynamics and Social Influence · Language and cultural evolution · Complex Network Analysis Techniques
