The brevity law as a scaling law, and a possible origin of Zipf's law for word frequencies
Alvaro Corral, Isabel Serra

TL;DR
This paper introduces a unified statistical framework linking linguistic laws, showing that word length and frequency distributions follow specific patterns and proposing a model-free explanation for Zipf's law based on these findings.
Contribution
It establishes a new joint probability model for word length and frequency, revealing scaling laws and providing a potential origin for Zipf's law in language.
Findings
Type-length distribution fits a gamma distribution better than lognormal.
Conditional frequency distributions at fixed length decay as a power law with exponent ~1.4.
A scaling law relates crossover frequency to word length, similar to critical phenomena.
Abstract
An important body of quantitative linguistics is constituted by a series of statistical laws about language usage. Despite the importance of these linguistic laws, some of them are poorly formulated, and, more importantly, there is no unified framework that encompasses all them. This paper presents a new perspective to establish a connection between different statistical linguistic laws. Characterizing each word type by two random variables, length (in number of characters) and absolute frequency, we show that the corresponding bivariate joint probability distribution shows a rich and precise phenomenology, with the type-length and the type-frequency distributions as its two marginals, and the conditional distribution of frequency at fixed length providing a clear formulation for the brevity-frequency phenomenon. The type-length distribution turns out to be well fitted by a gamma…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
