Verifying Heaps' law using Google Books Ngram data

Vladimir V. Bochkarev; Eduard Yu.Lerner; Anna V. Shevlyakova

arXiv:1612.09213·cs.CL·March 30, 2020

Verifying Heaps' law using Google Books Ngram data

Vladimir V. Bochkarev, Eduard Yu.Lerner, Anna V. Shevlyakova

PDF

Open Access

TL;DR

This paper investigates the validity of Heaps' law across European languages by analyzing Google Books Ngram data, revealing significant variations in the Heaps exponent over 60-100 year periods.

Contribution

It provides an empirical verification of Heaps' law using large-scale Ngram data and introduces a probability model to analyze the relationship between word frequency and text size.

Findings

01

Heaps' exponent varies significantly over time.

02

The analysis covers multiple European languages.

03

The study highlights temporal fluctuations in linguistic patterns.

Abstract

This article is devoted to the verification of the empirical Heaps law in European languages using Google Books Ngram corpus data. The connection between word distribution frequency and expected dependence of individual word number on text size is analysed in terms of a simple probability model of text generation. It is shown that the Heaps exponent varies significantly within characteristic time intervals of 60-100 years.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage and cultural evolution · Authorship Attribution and Profiling · Opinion Dynamics and Social Influence