
TL;DR
This paper introduces a novel, more accurate mathematical expression for Heap's Law, derived from first principles, which improves estimates of vocabulary growth and hapax legomena in large text corpora.
Contribution
The paper presents a new, first-principles derivation of Heap's Law that outperforms existing models in accuracy and extends to hapaxes and higher n-legomena.
Findings
New expression for type-token curve derived from first principles
Superior accuracy demonstrated on real text data
Extension to hapaxes and higher n-legomena
Abstract
Heap's Law states that in a large enough text corpus, the number of types as a function of tokens grows as for some free parameters . Much has been written about how this result and various generalizations can be derived from Zipf's Law. Here we derive from first principles a completely novel expression of the type-token curve and prove its superior accuracy on real text. This expression naturally generalizes to equally accurate estimates for counting hapaxes and higher -legomena.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
