Types, Tokens, and Hapaxes: A New Heap's Law

Victor Davis

arXiv:1901.00521·cs.CL·January 4, 2019

Types, Tokens, and Hapaxes: A New Heap's Law

Victor Davis

PDF

TL;DR

This paper introduces a novel, more accurate mathematical expression for Heap's Law, derived from first principles, which improves estimates of vocabulary growth and hapax legomena in large text corpora.

Contribution

The paper presents a new, first-principles derivation of Heap's Law that outperforms existing models in accuracy and extends to hapaxes and higher n-legomena.

Findings

01

New expression for type-token curve derived from first principles

02

Superior accuracy demonstrated on real text data

03

Extension to hapaxes and higher n-legomena

Abstract

Heap's Law states that in a large enough text corpus, the number of types as a function of tokens grows as $N = K M^{β}$ for some free parameters $K, β$ . Much has been written about how this result and various generalizations can be derived from Zipf's Law. Here we derive from first principles a completely novel expression of the type-token curve and prove its superior accuracy on real text. This expression naturally generalizes to equally accurate estimates for counting hapaxes and higher $n$ -legomena.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.