A scaling law beyond Zipf's law and its relation to Heaps' law
Francesc Font-Clos, Gemma Boleda, \'Alvaro Corral

TL;DR
This paper introduces a universal scaling law for word frequency distributions that remains consistent across text lengths, linking Zipf's and Heaps' laws and providing insights into vocabulary growth in large texts.
Contribution
It proposes a simple scaling form for word frequency distributions that is robust across different text lengths and connects Zipf's law with an alternative to Heaps' law.
Findings
Word frequency distribution shape remains constant with text growth.
Distribution fits a double power law with Zipf's exponent ~2.
An alternative to Heaps' law is derived from the scaling behavior.
Abstract
The dependence with text length of the statistical properties of word occurrences has long been considered a severe limitation quantitative linguistics. We propose a simple scaling form for the distribution of absolute word frequencies which uncovers the robustness of this distribution as text grows. In this way, the shape of the distribution is always the same and it is only a scale parameter which increases linearly with text length. By analyzing very long novels we show that this behavior holds both for raw, unlemmatized texts and for lemmatized texts. For the latter case, the word-frequency distribution is well fit by a double power law, maintaining the Zipf's exponent value for large frequencies but yielding a smaller exponent in the low frequency regime. The growth of the distribution with text length allows us to estimate the size of the vocabulary at each step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
