Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models
{\L}ukasz D\k{e}bowski

TL;DR
This paper proposes corrections to Zipf's and Heaps' laws by modeling hapax rate functions, with the logistic model providing the best fit for text length dependence.
Contribution
It introduces a systematic approach to correcting linguistic laws using hapax rate models, especially highlighting the effectiveness of the logistic model.
Findings
The logistic hapax rate model fits empirical data best.
Corrections improve the accuracy of linguistic law predictions.
Different hapax rate functions influence the form of Zipf's and Heaps' laws.
Abstract
The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the proportion of hapaxes, i.e., words that occur once. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the hapax rate is a simple function of the text length. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Topic Modeling
