Zipf's law holds for phrases, not words
Jake Ryland Williams, Paul R. Lessard, Suma Desu, Eric Clark, James P., Bagrow, Christopher M. Danforth, and Peter Sheridan Dodds

TL;DR
This paper demonstrates that Zipf's law applies to multi-word phrases over a much broader range than to individual words, using a new scalable method for phrase extraction.
Contribution
It introduces a novel statistical mechanical approach for partitioning text into meaningful phrases, extending Zipf's law applicability.
Findings
Zipf's law holds for phrases over nine orders of magnitude
A new scalable method for phrase extraction was developed
Zipf's law is limited to words over only three to four orders
Abstract
With Zipf's law being originally and most famously observed for word frequency, it is surprisingly limited in its applicability to human language, holding over no more than three to four orders of magnitude before hitting a clear break in scaling. Here, building on the simple observation that phrases of one or more words comprise the most coherent units of meaning in language, we show empirically that Zipf's law for phrases extends over as many as nine orders of rank magnitude. In doing so, we develop a principled and scalable statistical mechanical method of random text partitioning, which opens up a rich frontier of rigorous text analysis via a rank ordering of mixed length phrases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Advanced Text Analysis Techniques · Natural Language Processing Techniques
