Uniform versus Zipf distribution in a mixing collection process
Aristides V. Doumas, Vassilis G. Papanicolaou

TL;DR
This paper analyzes a variant of the collector's problem where coupon probabilities are a mix of uniform and Zipf distributions, deriving asymptotics and limiting distribution for the number of trials needed to collect all coupon types.
Contribution
It provides the first asymptotic analysis of the mixing of uniform and Zipf distributions in the collector's problem, including expectation, variance, and distribution results.
Findings
Asymptotic expectation of collection time derived
Variance and second moment asymptotics obtained
Limiting distribution of collection time established
Abstract
We consider the following variant of the classic collector's problem: The family of coupon probabilities is the mixing of two subfamilies one of which is the \textit{uniform} family, while the other belongs to the well known \textit{Zipf family}. We obtain asymptotics for the expectation, the second rising moment, and the variance of the random variable , namely the number of trials needed for all the types of coupons to be collected (at least once, with replacement) as . It is interesting that the effect of the uniform subcollection on the asymptotics of the expectation of (at least up to the sixth term) appears only in the leading factor of the expectation of . The limiting distribution of is derived as well. These results answer a question placed in a recent work of ours [\textit{Electron. J. Probab.} \textbf{18} (2012) 1--15].
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Uniform versus Zipf distribution in a mixing collection process
Aristides V. Doumas1 and Vassilis G. Papanicolaou2111 Department of Mathematics, National Technical University of Athens, Zografou Campus, 157 80 Athens, GREECE, 1[email protected] 2[email protected]
Abstract
We consider the following variant of the classic collector’s problem: The family of coupon probabilities is the mixing of two subfamilies one of which is the uniform family, while the other belongs to the well known Zipf family. We obtain asymptotics for the expectation, the second rising moment, and the variance of the random variable , namely the number of trials needed for all the types of coupons to be collected (at least once, with replacement) as . It is interesting that the effect of the uniform subcollection on the asymptotics of the expectation of (at least up to the sixth term) appears only in the leading factor of the expectation of . The limiting distribution of is derived as well. These results answer a question placed in a recent work of ours [Electron. J. Probab. 18 (2012) 1–15].
Keywords. Urn problems; coupon collector’s problem; generalized Zipf law; Gumbel distribution; mixing processes.
2010 AMS Mathematics Classification. 60F05; 60F99.
1 Introduction and motivation
The “coupon collector’s problem” (CCP) pertains to a population whose members are of different types. For we denote by the probability that a member of the population is of type , where and . We refer to the ’s as the coupon probabilities. The members of the population are sampled independently with replacement (alternatively, the polulation is assumed very large) and their types are recorded. Naturally, one quantity of interest is the number of trials needed until all types are detected (at least once). CCP belongs to the family of the so-called urn problems and it has been studied extensively; see, e.g., [5] and the references therein. Moreover, due to its applications in several areas of science, new variants keep arising.
Let be a sequence of strictly positive numbers. Then, for each integer one can create a probability measure on the set of types by taking
[TABLE]
In a recent work (see [4]) the authors asked what happens in the average when the sequence is the “union” of two subsequences one of which is constant (this corresponds to a uniform subcollection of coupons), while the other obeys some rather general law, in particular the law of the well-known Zipf family.222In [4] the authors also asked the same question when the family of coupon probabilities is the “mixing” of two constant subsequences. For an answer see [6]. Zipf, this surprising law of nature, arises in many areas of science, such as computer science, physics, biology, earth and planetary sciences, economics and finance, as well as linguistics, demography, and the social sciences (see, e.g., the highly cited article [12] of Mark Newman, where he reviewed some of the empirical evidence for the existence of power-law forms, and the recent work [10] of Locey and Lennon on the applications of power-laws in biology).
In this paper we bring an answer to the above question by deriving the asymptotics of the expectation and of the second moment (up to the fifth and sixth term respectively) as , as well as the limit distribution of (under the apropriate normalization). Let
[TABLE]
where is a sequence of strictly positive numbers of the form
[TABLE]
The case where corresponds to the standard Zipf distribution. For general positive values of we have the so–called generalized Zipf subfamily of coupons.
Testing uniform and the standard Zipf distribution is not a new idea. We refer the reader to the highly cited articles [11] on the search and replication in unstructured peer-to-peer networks, and [2] on the benchmarking cloud serving systems with the Yahoo! Cloud Serving Benchmark (YCSB) framework. However, in this paper we consider the problem of the coexistence of uniform and generalized Zipf distributions in the same model. The question about the effect of the uniform–Zipf distribution on the average of the random variable arises naturally. As we will see, the uniform subcollection acts on the asymptotics of the expectation of only in the leading factor (at least up to the fifth term of its asymptotic expansion). The same argument holds for the second rising moment of up to the sixth term. In comparison with the classic version of the problem (when all coupons are uniformly distributed), or with the case where all coupons are Zipf distributed, the effect of the uniform subcollection (in the mixing case studied here) causes a significant increment in the number of trials needed for a complete set of coupons. This argument will be illustrated via an example at the end of the paper.
2 Main results
It is well known (see, e.g., [9]) that the expectation of can be expressed as
[TABLE]
From now on we assume that is even and for convenience we set
[TABLE]
By substituting and thanks to the binomial theorem, formula (4) (in view of (2)) yields
[TABLE]
Notice that from (1) and (2) we have
[TABLE]
The study of the quantity of (7) is an external matter. In particular, one easily gets its full asymptotic expansion via the celebrated Euler–Maclaurin summation formula, as we will shortly see in the last step of the proof of our main theorem. Let be the number of trials needed for one to collect (with replacement) all different types of coupons when the coupon probabilities are
[TABLE]
Then, (4) implies
[TABLE]
Thus, (6) yields
[TABLE]
The main results of the paper are presented in the following
Theorem 1
Let the sequence be the “union” of two subsequences, as given by (2), (3), one of which is constant (this corresponds to a uniform subcollection of coupons), while the other belongs to the generalized Zipf family, namely . Then, as we have
[TABLE]
where is, as usual, the Euler–Mascheroni constant. Regarding the second rising moment333under the notation and the variance of the r.v. we have
[TABLE]
[TABLE]
Moreover, appropriately normalized converges in distribution to a standard Gumbel random variable. More precisely as
[TABLE]
where,
[TABLE]
Proof of Theorem 1. Starting from (9) (recall that ), we focus on the quantities
[TABLE]
If we set
[TABLE]
then, in view of (3) and under the change the variables formula (15) becomes
[TABLE]
The following result is important for our analysis:
[TABLE]
uniformly in , for any fixed .
The proof is based on the method of integration by parts and is omitted. By (18), the comparison of sums and integrals, and the Taylor expansion of the logarithm we get
[TABLE]
Taking advantage of (19) and for any given we rewrite (17) as
[TABLE]
where
[TABLE]
As we will see all the information we need comes from . Starting from (23) and using (19) we get
[TABLE]
By invoking (18) and integrating by parts the above becomes
[TABLE]
Our next task is of (22). By applying the Taylor expansion of the logarithm and using the comparison of sums and integrals, as well as the result presented in formula (18) (since in this case is strictly positive), and finally, changing the variables as
[TABLE]
one arrives at
[TABLE]
Since, for , , the integral appearing in (25) yields
[TABLE]
In order to obtain the leading behavior of the integral above as it suffices to work with the integral
[TABLE]
Changing the variables as and applying the Laplace method for integrals (see, e.g., [1]) we arrive at
[TABLE]
and by invoking (25) one gets
[TABLE]
From (24) and (26) one has that is negligible compared to as . Finally, for of (21) we have
[TABLE]
From (18) and (26) one has that is negligible compared to as and, as we have seen, the same argument holds for . Hence, from (20) we get
[TABLE]
To complete our analysis, and in view of (9), one must obtain the leading term of the quantity
[TABLE]
It is not hard to check that
[TABLE]
Let us now return to (9) and the quantity . Under (3) the first five terms of the asymptotics of (as ) are known. In particular, (see [3] and [5])
[TABLE]
By invoking (28) and (29) in (9) we have
[TABLE]
Last step before the expectation. To obtain the asymptotics of one has to investigate the asymptotics of . By the celebrated Euler–Maclaurin summation formula (see, e.g. [1]) the full aymptotic expansion of is known (as ). In particular, the leading term in the asymptotics of depends on the behaviour of the series . If we have
[TABLE]
where denotes the Riemann zeta function, while for we have
[TABLE]
For , namely the case of the standard Zipf distribution we have
[TABLE]
Claim. The effect of the uniform subcollection on the asymptotics of the expectation of (at least up to the sixth term) appears only in the leading factor of (30). To wit (see (30)) it suffices to check that as
[TABLE]
The proof of (34) is immediate in all three cases given in ((31)–(33)). The result for the expectation of the r.v. now follows by invoking (34) in (30).
Second moment, variance and distribution of . Mimicking the derivation of the asymptotics of it is straightforward to get the asymptotics of the second rising moment of the random variable . We have (see, e.g., [3])
[TABLE]
where we have used the notation . Similarly to formula (6) we get
[TABLE]
Likewise, similarly to (9) one has
[TABLE]
where
[TABLE]
and as in (8)
[TABLE]
Under formula (3) the first six terms of the asymptotics of (as ) are known (see [3] and [5]). Finally, one arives at the desired result. Again, the effect of the uniform subcollection in the asymptotics of the second rising moment of the random variable appears only in the leading factor of the second rising moment of the random variable .
Observation. It is straightforward for one to check that the same result holds for all the rising moments of the random variable . Having (10) and (11) it is easy to obtain leading asymptotics for the variance of . Using the formula
[TABLE]
we get (12) as . The previous results drive us to normalize as
[TABLE]
where, and are given in (14), and by a well known theorem (see, e.g., [5]) one obtains the final result of Theorem 1 (i.e., the r.v. converges in distribution to a standard Gumbel r.v.). We remind the reader that in the classic version of the problem (namely, the case of one class of uniformly distributed coupons) the corresponding limiting theorem is due to P. Erdős and A. Rényi:
[TABLE]
see [7], while for the case the coupon probabilities are distributed according to the Zipf law we have the following theorem (see [3] and [5])
[TABLE]
To support the above limiting results let us consider the following
Example. Recall that the coupon probabilities satisfy (1)-(2), where and . Let us compute the minimum number of trials, so that with probability we get a complete set of all different types of coupons when and .
We have . Hence, . Assume that the answer is trials. By (13) we have
[TABLE]
where So that . Thus, with probability one needs at least 11,996 trials to collect all different types of coupons.
Now, let us compare our results with the classic version of the problem when all the different coupons are distributed according to the standard Zipf law. In this case we have from (41): Hence, . Assume that the answer is trials. We have
[TABLE]
where Similarly, with probability one needs at least 2,765 trials to collect all different types of coupons.
Finally, suppose that all the different coupons are uniformly distributed. Hence, (40) yields that with probability at least 686 trials are needed.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C.M. Bender and S.A. Orszag, Advanced Mathematical Methods for Scientists and Engineers I: Asymptotic Methods and Perturbation Theory , Springer-Verlag, New York, 1999.
- 2[2] B.F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, Benchmarking cloud serving systems with YCSB, Proc. 1st ACM Symp. Cloud Comput. (2010) pp. 143–154.
- 3[3] A.V. Doumas and V.G. Papanicolaou, The Coupon Collector’s Problem Revisited: Asymptotics of the Variance, Adv. Appl. Prob. 44 (1) (2012) 166–195.
- 4[4] A.V. Doumas and V.G. Papanicolaou, Asymptotics of the rising moments for the Coupon Collector’s Problem, Electron. J. Probab. Vol. 18 (Article no. 41) (2012) 1–15.
- 5[5] A.V. Doumas and V.G. Papanicolaou, The Coupon Collector’s Problem Revisited: Generalizing the Double Dixie Cup Problem of Newman and Shepp, ESAIM: Probability and Statistics , 20 (2016) 367–399 (DOI: http://dx.doi.org/10.1051/ps/2016016).
- 6[6] A.V. Doumas and V.G. Papanicolaou, Sampling from a Mixture of Different Groups of Coupons, ar Xiv:1709.04500 [math.PR] , https://arxiv.org/abs/1709.04500.
- 7[7] P. Erdős and A. Rényi, On a classical problem of probability theory, Magyar. Tud. Akad. Mat. Kutató Int. Közl. , 6 (1961), 215–220.
- 8[8] W. Feller, An Introduction to Probability Theory and Its Applications , Vol. I & II, John Wiley & Sons, Inc., New York, 1966.
