Asymptotically normal estimators for Zipf's law
Mikhail Chebunin, Artyom Kovalevskii

TL;DR
This paper develops asymptotically normal estimators for the exponent in Zipf's law, enabling more accurate statistical analysis of word frequency distributions in texts.
Contribution
It introduces new estimators based on word diversity statistics that are asymptotically normal, improving inference for Zipf's law parameters.
Findings
Estimators are asymptotically normal under the infinite urn model.
The method provides consistent estimates of the Zipf exponent.
Application to real texts demonstrates effectiveness.
Abstract
Zipf's law states that sequential frequencies of words in a text correspond to a power function. Its probabilistic model is an infinite urn scheme with asymptotically power distribution. The exponent of this distribution must be estimated. We use the number of different words in a text and similar statistics to construct asymptotically normal estimators of the exponent.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Asymptotically normal estimators for Zipf’s law
Mikhail Chebunin, Artyom Kovalevskii E-mail: [email protected], Novosibirsk State University, Novosibirsk, RussiaE-mail: [email protected], Novosibirsk State Technical University, Novosibirsk State University, Novosibirsk State University of Economics and Management, Novosibirsk, Russia. The research was supported by RFBR grant 17-01-00683
Abstract
Zipf’s law states that sequential frequences of words in a text correspond to a power function. Its probabilistic model is an infinite urn scheme with asymptotically power distribution. The exponent of this distribution must be estimated. We use the number of different words in a text and similar statistics to construct asymptotically normal estimators of the exponent.
Keywords: infinite urn scheme, Zipf’s law, asymptotic normality.
1 Introduction
Zipf’s law (Zipf, 1949) states that sequential frequences of words in a text are equal to , , , . Its modification is Mandelfrot’s law (Mandelbrot, 1965) , .
Probabilistic interpretation of these and similar laws is an infinite urn scheme studied by Bahadur (1960), Karlin (1967). There are balls that are distributed to urns independently and randomly; there are infinitely many urns. Each ball goes to urn with probability , (frequences converge a.s. to probabilities).
We assume that and that one of the following asymptotics hold (the second is wider than the first):
[TABLE]
, (this assumption includes Zipf’s and Mandelbrot’s laws);
[TABLE]
, is a slowly varying function of .
Our aim is to construct asymptotically normal estimators of under (1). We will prove its strong consistency under (2). So we will use statistics that have been studied by Bahadur (1960), Karlin (1967), Dutko (1989), Key (1992, 1996), Zakrevskaya Kovalevskii (2001), Gnedin, Hansen Pitman (2007), Boonta Neammanee (2007), Hwang Janson (2008), Bogachev, Gnedin Yakubovich (2008), Barbour (2009), Barbour Gnedin (2009), Ohannessian Dahleh (2012), Chebunin (2014), Chebunin Kovalevskii (2016), Muratov Zuyev (2016), Ben-Hamou, Boucheron Ohannessian (2017).
Let us denote by the number of balls in urn . is the number of nonempty urns, and is the number of urns with not lesser than balls
[TABLE]
Note that . Numbers of urns with exactly balls: . The number of urns with odd number of balls
[TABLE]
Karlin (1967) suggested to study a random sample with a random number of experiments . Here is a Poisson process with parameter . Procedure of the random choice of an urn and the Poisson process are independent. Processes are independent Poisson with parameters . Along with the listed papers, the poissonization was used by Ben-Hamou, Boucheron Gassiat (2016) in estimating codes on countable alphabets, by Durieu Wang (2016) for proof of functional CLT for some randomization of statistics and , by Grubel Hitczenko (2009) in studying limit distributions of gaps in discrete random samples, by Khmaladze (2011) for more general occupancy schemes.
From definition
[TABLE]
Karlin (1967) introduced function and proved that (2) resulted in , is a slowly varying function as .
Karlin proved SLLNs for all these statistics under (2). Karlin proved CLTs for , and vector for any finite .
Karlin proved that asymptotics of expectations of all of these statistics is proportional to with some coefficient depending on only. This law was found for texts empirically (with ) by Herdan (1960) and Heaps (1978, Sect. 3.7). It is interesting that modern large-scale studies of languages demonstrate a deviation from this law (Petersen et al., 2012) that is interpreted as a decrease of need in new words.
The authors do not know any estimator of with proved asymptotic normality. An estimator of Zakrevskaya Kovalevskii (2001) founded by a substitution method is (we will see it) asymptotically normal for Zipf’s law but authors proved consistency only. An estimator of Chebunin (2014) is strongly consistent but is not asymptotically normal. We will prove asymptotic normality of estimators of Ohannessian Dahleh (2012) under (1) but authors proved only strong consistency under (2).
The rest of the paper is organized as follows. In Section 2 we construct asymptotically normal estimators of using only one of the statistics. It is possible only if we know constant (it can be a differentiable function of ) in (1), and all the estimators in this case are implicit. In Section 3 we prove asymptotic normality of estimators that based on two statistics. We use multidimensional CLTs for that have proved by Karlin (1967) and for that we prove in Appendix in a functional generalization.
We use designation for weak convergence to a normal distribution with zero mean and variance . All convergencies are under .
2 Implicit estimators that use only one statistics
We will prove a general theorem for some abstract statistics in the infinite urn scheme with neccesary properties Then we will prove these properties to be held for all statistics under consideration if one assume (1).
Let as , where is a slowly varying function. Let us define as a solution of equation
[TABLE]
As , so
[TABLE]
So is a strongly consistent estimator of . We will study asymptotic normality of . Let
[TABLE]
is a slowly varying function as .
Theorem 1
Let (4) be held and
[TABLE]
* be a solution of (3). Then*
[TABLE]
Proof. . From (4)
[TABLE]
as . Then
[TABLE]
[TABLE]
[TABLE]
in probability as . The theorem is proved.
If is differentiable on then as . Really, , and
[TABLE]
Let , (2) holds and as . Then . For example,
[TABLE]
is integer, is Riemann function. In this case . From SLLN
[TABLE]
If we use estimator (it is consistent, Chebinin (2014)) then goes to a some constant a.s. So we need in implicit estimators for asymptotic normality. We will base implicit estimators on , or . Karlin (1967) proved
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Lemma 1
If then
[TABLE]
[TABLE]
Proof. There are convergencies (see Karlin (1967) and Gnedin, Hansen Pitman (2007), Lemma 1)
[TABLE]
We use Karlin (1967) representation, integration by parts and substitution :
[TABLE]
[TABLE]
Analogously for and . Proof is complete.
Lemma 2
If (1) holds then .
Proof. Let us solve equation for large enough .
[TABLE]
[TABLE]
Proof is complete.
Corollary 1
If (1) holds, is known, exists, , , are solutions of equations
[TABLE]
respectively, then
[TABLE]
[TABLE]
3 Explicit estimators on a base of two statistics
Let parameter (function) be unknown. In this case we need in two statistics to estimate . Some of the following estimators are proposed by Ohannessian Dahleh (2012). We will prove its asymptotical normality. Note that rates of convergence will be slower in this case.
Theorem 2
If then
[TABLE]
Proof. Using SLLN we have
[TABLE]
[TABLE]
Then we calculate limiting variance using Corollary 3. Proof is complete.
Note that for .
Theorem 3
If then
[TABLE]
[TABLE]
* is a Beta function.*
Proof. Using SLLN we have
[TABLE]
[TABLE]
[TABLE]
Then we calculate limiting variance on the base of Theorem 5 in Karlin (1967). Proof is complete.
From Lemma 1 and Lemma 2 we obtain the following corollary.
Corollary 2
Assumptions of Theorem 2 and Theorem 3 are held under (1).
Appendix: Functional Central Limit Theorem
Let for
[TABLE]
Theorem 4
Let us assume that (2) holds, is integer. Then random process converges weakly in the uniform metrics in to -dimensional Gaussian process with zero expectation and covariance function ,
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
.
Proof. We base on Theorem 3 in Chebunin Kovalevskii (2016) and use formulas
[TABLE]
[TABLE]
Proof is complete.
The limiting -dimensional Gaussian process is self-similar with Hurst parameter . Its first component coinsides in distribution with the first component of the limiting process in Theorem 1 in Durieu Wang (2015).
We need in a some specific corollary to calculate limiting variance in Theorem 2.
Corollary 3
In assumptions of Theorem 4, random vector converges weakly to a normal one with zero mean and covariance matrix
[TABLE]
Acknowledgement Our research was supported by RFBR grant 17-01-00683.
Bahadur, R. R., 1960. On the number of distinct values in a large sample from an infinite discrete distribution. Proceedings of the National Institute of Sciences of India, 26A, Supp. II, 67–75.
Barbour, A. D., 2009. Univariate approximations in the infinite occupancy scheme. Alea 6, 415–433.
Barbour, A. D., Gnedin, A. V., 2009. Small counts in the infinite occupancy scheme. Electronic Journal of Probability, Vol. 14, Paper no. 13, 365–384.
Ben-Hamou, A., Boucheron, S., Gassiat, E., 2016. Pattern coding meets censoring: (almost) adaptive coding on countable alphabets. Preprint. arXiv:1608.08367.
Ben-Hamou, A., Boucheron, S., Ohannessian, M. I., 2017. Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli, V. 23, 249–287.
Bogachev, L. V., Gnedin, A. V., Yakubovich, Y. V., 2008. On the variance of the number of occupied boxes. Adv. Appl. Math., V. 40, 401–432.
Boonta, S., Neammanee, K., 2007. Bounds on random infinite urn model. Bulletin of the Malaysian Mathematical Sciences Society. Second Series, V. 30.2, 121–128.
Chebunin, M. G., 2014. Estimation of parameters of probabilistic models which is based on the number of different elements in a sample. Sib. Zh. Ind. Mat., 17:3, 135–147 (in Russian).
Chebunin, M., Kovalevskii, A., 2016. Functional central limit theorems for certain statistics in an infinite urn scheme. Statistics and Probability Letters, V. 119, 344–348.
Durieu, O., Wang, Y., 2016. From infinite urn schemes to decompositions of self-similar Gaussian processes. Electron. J. Probab., 2016, V. 21, paper No. 43.
Dutko, M., 1989. Central limit theorems for infinite urn models, Ann. Probab. 17, 1255–1263.
Gnedin, A., Hansen, B., Pitman, J., 2007. Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws. Probability Surveys, Vol. 4, 146–171.
Grubel, R., and Hitczenko, P., 2009. Gaps in discrete random samples, J. Appl. Probab., V. 46, 1038–1051.
Heaps, H. S., 1978. Information Retrieval: Computational and Theoretical Aspects, Academic Press.
Herdan, G., 1960. Type-token mathematics, The Hague: Mouton.
Hwang, H.-K., Janson, S., 2008. Local Limit Theorems for Finite and Infinite Urn Models. The Annals of Probability, Vol. 36, No. 3, 992–1022.
Karlin, S., 1967. Central Limit Theorems for Certain Infinite Urn Schemes. Jounal of Mathematics and Mechanics, Vol. 17, No. 4, 373–401.
Key, E. S., 1992. Rare Numbers. Journal of Theoretical Probability, Vol. 5, No. 2, 375–389.
Key, E. S., 1996. Divergence rates for the number of rare numbers. Journal of Theoretical Probability, Volume 9, No. 2, 413–428.
Khmaladze, E. V., 2011. Convergence properties in certain occupancy problems including the Karlin-Rouault law, J. Appl. Probab., V. 48, 1095–1113.
Mandelbrot, B., 1965. Information Theory and Psycholinguistics. In B.B. Wolman and E. Nagel. Scientific psychology. Basic Books.
Muratov, A., and Zuyev, S., 2016. Bit flipping and time to recover, J. Appl. Probab., V. 53, 650–666.
Ohannessian, M. I., Dahleh, M. A., 2012. Rare probability estimation under regularly varying heavy tails, Proceedings of the 25th Annual Conference on Learning Theory, PMLR 23:21.1–21.24.
Petersen, A. M., Tenenbaum, J. N., Havlin, S., Stanley, H. E., Perc, M., 2012. Languages cool as they expand: Allometric scaling and the decreasing need for new words. Scientific Reports 2, Article No. 943.
Zakrevskaya, N. S., Kovalevskii, A. P., 2001. One-parameter probabilistic models of text statistics. Sib. Zh. Ind. Mat., 4:2, 142–153 (in Russian).
Zipf, G. K., 1949. Human behavior and the principle of least effort. Cambridge: Univ. Press.
