# Types, Tokens, and Hapaxes: A New Heap's Law

**Authors:** Victor Davis

arXiv: 1901.00521 · 2019-01-04

## TL;DR

This paper introduces a novel, more accurate mathematical expression for Heap's Law, derived from first principles, which improves estimates of vocabulary growth and hapax legomena in large text corpora.

## Contribution

The paper presents a new, first-principles derivation of Heap's Law that outperforms existing models in accuracy and extends to hapaxes and higher n-legomena.

## Key findings

- New expression for type-token curve derived from first principles
- Superior accuracy demonstrated on real text data
- Extension to hapaxes and higher n-legomena

## Abstract

Heap's Law states that in a large enough text corpus, the number of types as a function of tokens grows as $N=KM^\beta$ for some free parameters $K,\beta$. Much has been written about how this result and various generalizations can be derived from Zipf's Law. Here we derive from first principles a completely novel expression of the type-token curve and prove its superior accuracy on real text. This expression naturally generalizes to equally accurate estimates for counting hapaxes and higher $n$-legomena.

---
Source: https://tomesphere.com/paper/1901.00521