Large-scale analysis of Zipf's law in English texts
Isabel Moreno-S\'anchez, Francesc Font-Clos, \'Alvaro Corral

TL;DR
This study rigorously tests three versions of Zipf's law across over 30,000 English texts, finding that one version fits more than 40% of texts at a high significance level, using advanced statistical methods.
Contribution
It provides a large-scale, statistically rigorous evaluation of Zipf's law in English texts, clarifying its applicability and form.
Findings
One version of Zipf's law fits over 40% of texts.
The fitting uses a pure power-law form with one parameter.
The study employs state-of-the-art goodness-of-fit tests.
Abstract
Despite being a paradigm of quantitative linguistics, Zipf's law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf's law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf's law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf's law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
