A statistical test for correspondence of texts to the Zipf-Mandelbrot law
Anik Chakrabarty, Mikhail Chebunin, Artyom Kovalevskii, Ilya Pupyshev,, Natalia Zakrevskaya, Qianqian Zhou

TL;DR
This paper introduces a statistical test based on a probabilistic model to assess how well texts follow the Zipf-Mandelbrot law, using convergence to a Gaussian process for analysis.
Contribution
It develops a novel method for testing text conformity to the Zipf-Mandelbrot law through empirical process convergence and provides algorithms for practical implementation.
Findings
Texts in multiple languages conform to the model to varying degrees
The empirical process converges to a Gaussian process with continuous paths
The method effectively distinguishes different language texts based on their statistical properties
Abstract
We analyse correspondence of a text to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary. The probability distribution correspond to the Zipf---Mandelbrot law. We count sequentially the numbers of different words in the text and get the process of the numbers of different words. Then we estimate Zipf---Mandelbrot law parameters using the same sequence and construct an estimate of the expectation of the number of different words in the text. Then we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from 0 to 1. We prove that this process (the empirical text bridge) converges weakly in the uniform metric on to a centered Gaussian process with continuous a.s. paths. We develop and implement an algorithm for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Bayesian Methods and Mixture Models · Fractal and DNA sequence analysis
MethodsGaussian Process
