Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models

{\L}ukasz D\k{e}bowski

arXiv:2307.12896·cs.CL·May 27, 2025

Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models

{\L}ukasz D\k{e}bowski

PDF

Open Access 2 Repos

TL;DR

This paper proposes corrections to Zipf's and Heaps' laws by modeling hapax rate functions, with the logistic model providing the best fit for text length dependence.

Contribution

It introduces a systematic approach to correcting linguistic laws using hapax rate models, especially highlighting the effectiveness of the logistic model.

Findings

01

The logistic hapax rate model fits empirical data best.

02

Corrections improve the accuracy of linguistic law predictions.

03

Different hapax rate functions influence the form of Zipf's and Heaps' laws.

Abstract

The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the proportion of hapaxes, i.e., words that occur once. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the hapax rate is a simple function of the text length. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Topic Modeling