Verifying Heaps' law using Google Books Ngram data
Vladimir V. Bochkarev, Eduard Yu.Lerner, Anna V. Shevlyakova

TL;DR
This paper investigates the validity of Heaps' law across European languages by analyzing Google Books Ngram data, revealing significant variations in the Heaps exponent over 60-100 year periods.
Contribution
It provides an empirical verification of Heaps' law using large-scale Ngram data and introduces a probability model to analyze the relationship between word frequency and text size.
Findings
Heaps' exponent varies significantly over time.
The analysis covers multiple European languages.
The study highlights temporal fluctuations in linguistic patterns.
Abstract
This article is devoted to the verification of the empirical Heaps law in European languages using Google Books Ngram corpus data. The connection between word distribution frequency and expected dependence of individual word number on text size is analysed in terms of a simple probability model of text generation. It is shown that the Heaps exponent varies significantly within characteristic time intervals of 60-100 years.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Authorship Attribution and Profiling · Opinion Dynamics and Social Influence
