A statistical test for the Zipf's law by deviations from the Heaps' law

Mikhail Chebunin; Artyom Kovalevskii

arXiv:1711.01083·math.ST·May 2, 2019

A statistical test for the Zipf's law by deviations from the Heaps' law

Mikhail Chebunin, Artyom Kovalevskii

PDF

TL;DR

This paper introduces a statistical test to examine the relationship between Zipf's law and Heaps' law in texts, based on a probabilistic model where words are chosen independently with a power-law distribution.

Contribution

It establishes a connection between Bahadur's probabilistic model and empirical linguistic laws, and develops a novel statistical test for Zipf's law deviations.

Findings

01

The model links word frequency distribution to vocabulary growth.

02

The proposed test detects deviations from Zipf's law.

03

The analysis bridges theoretical models and empirical observations.

Abstract

We explore a probabilistic model of an artistic text: words of the text are chosen independently of each other in accordance with a discrete probability distribution on an infinite dictionary. The words are enumerated 1, 2, $\dots$ , and the probability of appearing the $i$ 'th word is asymptotically a power function. Bahadur proved that in this case the number of different words depends on the length of the text is asymptotically a power function, too. On the other hand, in the applied statistics community, there exist statements supported by empirical observations, the Zipf's and the Heaps' laws. We highlight the links between Bahadur results and Zipf's/Heaps' laws, and introduce and analyse a corresponding statistical test.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.