Minimax Risk for Missing Mass Estimation
Nikhilesh Rajaraman, Andrew Thangaraj, Ananda Theertha Suresh

TL;DR
This paper analyzes the minimax risk in missing mass estimation, providing bounds for the worst-case risk of the Good-Turing estimator and establishing a lower bound for the minimax risk, with implications for practical and theoretical applications.
Contribution
It presents the first known bounds on the minimax risk for missing mass estimation, including the worst-case risk of the Good-Turing estimator and a new lower bound.
Findings
Good-Turing estimator risk between 0.6080/n and 0.6179/n
Minimax risk lower bounded by 0.25/n
First published minimax risk bounds for missing mass estimation
Abstract
The problem of estimating the missing mass or total probability of unseen elements in a sequence of random samples is considered under the squared error loss function. The worst-case risk of the popular Good-Turing estimator is shown to be between and . The minimax risk is shown to be lower bounded by . This appears to be the first such published result on minimax risk for estimation of missing mass, which has several practical and theoretical applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Minimax Risk for Missing Mass Estimation
Nikhilesh Rajaraman, Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology Madras
Chennai 600036, India
Ananda Theertha Suresh
Google Research
New York, USA
Abstract
The problem of estimating the missing mass or total probability of unseen elements in a sequence of random samples is considered under the squared error loss function. The worst-case risk of the popular Good-Turing estimator is shown to be between and . The minimax risk is shown to be lower bounded by . This appears to be the first such published result on minimax risk for estimation of missing mass, which has several practical and theoretical applications.
I Introduction
Given independent samples from an unknown distribution, missing mass estimation asks for the sum of the probability of the unseen elements. Missing mass estimation is a basic problem in statistics and has wide applications in several fields ranging from language modeling [1, 2] to ecology [3]. Perhaps the most used missing mass estimator is the Good-Turing estimator which was proposed in a seminal paper by I. J. Good and Alan Turing in 1953 [4]. The Good-Turing estimator is used in support estimators [3], entropy estimators [5] and unseen species estimators [6]. To describe the estimator and the results, we need a modicum of nomenclature.
Let be an underlying unknown distribution over an unknown domain . Let be independent samples from . For , let be the number of appearances of in . Upon observing , our goal is to estimate the missing mass
[TABLE]
where denotes the indicator function. For example, if and , then . The above sampling model for estimation is termed the multinomial model. We note that is often referred as sample coverage in the literature [7].
An estimator for missing mass is a mapping from . For a distribution , the risk of the estimator is
[TABLE]
and the worst-case risk over all distributions is
[TABLE]
and minimax mean squared loss or minimax risk is
[TABLE]
The goal of this paper is to characterize .
I-A Good-Turing estimator and previous results
Let
[TABLE]
denote the number of symbols that have appeared times in , . For example, if , then and for all . The Good-Turing estimator [4] for the missing mass is
[TABLE]
One of the first theoretical analysis of the Good-Turing estimator was in [8], where it was shown that
[TABLE]
This shows that the bias of the Good-Turing estimator falls as . They further showed that with probability ,
[TABLE]
Various properties of the Good-Turing estimator and several variations of it have been analyzed for distribution estimation and compression [9, 10, 11, 12, 13, 14, 15]. Several concentration results on missing mass estimation are also known [16, 17]. Despite all this work, the risk of the Good-Turing estimator and the minimax risk of missing mass estimation have still not been conclusively established.
I-B New results
Unlike parameters of a distribution, missing mass itself is a function of the observed sample and that makes finding the exact minimax risk difficult.
We first analyze the risk of the Good-Turing estimator and show that for any distribution ,
[TABLE]
where is abbreviated notation for . By maximizing the RHS in the first equation above over all distributions, in Theorem 4, we show that
[TABLE]
We note that under the multinomial model, the numbers of occurrences of symbols are correlated, and this makes finding the worst case distribution for the Good-Turing estimator difficult.
We then prove estimator-independent information-theoretic lower bounds on using two approaches. We first compute the lower bound via Dirichlet prior approach [18]. In Lemma 7, we show that
[TABLE]
We then improve the constant by reducing the problem of missing mass estimation to that of distribution estimation. In particular, in Theorem 11, we show that
[TABLE]
Combining the lower and upper bounds, we get
[TABLE]
Finding the exact minimax risk for the missing mass estimation problem remains an open question.
The rest of the paper is organized as follows. In Section II, we analyze the Good-Turing estimator. In Section III-A, we use Dirichlet prior approach to obtain lower bounds and in Section III-B we obtain lower bounds via reduction.
II Risk of Good-Turing Estimator
The analysis of [8] can be extended to characterize the risk of the Good-Turing estimator for missing mass. The squared error of the Good-Turing estimator can be written down as follows:
[TABLE]
For , . Using the notation , we get
[TABLE]
The probability can be written down as
[TABLE]
where and . The summation in (4) is first split into two cases: and . Denoting , we have, for ,
[TABLE]
For , observe that . Using the above observations, the summation in (4) simplifies to
[TABLE]
The following lemma is useful in bounding certain terms in the first summation above as a function of , independent of the unknowns and .
Lemma 1**.**
For , ,
[TABLE]
Proof:
Let and be a pair of independent and identical random variables with marginal distribution . Define a random variable , whose value for and, for ,
[TABLE]
We see that is a probability for , and that it takes values in in all cases. Therefore, its expectation
[TABLE]
which concludes the proof. ∎
A useful univariate version of Lemma 1 is the following.
Lemma 2**.**
For ,
[TABLE]
Proof:
For , define and follow the proof of Lemma 1. ∎
Using Lemma 1, observe that
[TABLE]
Therefore, the risk can be written as
[TABLE]
The summation terms above can be rewritten as follows:
[TABLE]
where follows using Lemma 2.
[TABLE]
Using the above expressions in (8), we get the following characterization of the risk.
Theorem 3**.**
The risk of the Good-Turing estimator under squared error loss satisfies
[TABLE]
II-A Upper bound on risk
To obtain a tight upper bound on the risk, we start with the following upper bound on one of the terms in (8):
[TABLE]
where the first step follows because for a fraction , and the second step follows because for . Using (9), (10) and (13) in (8), an upper bound for the risk of the Good-Turing estimator is
[TABLE]
where the last step follows because for a fraction . The above constant is not best possible, and could be marginally improved by more careful analysis. However, we show that the improvement is not significant through a lower bound on by picking to be a suitable uniform distribution.
II-B Lower bound on the Good-Turing worst-case risk
A lower bound can be obtained for the worst case risk of the Good-Turing estimator by evaluating the risk for the uniform distribution on . Let and for all , where is a positive constant. Using (8), we get
[TABLE]
where the reasoning for the steps is as follows:
- a)
replacing with . 2. b)
using the fact that .
The coefficient of in (15) can be maximized numerically to obtain a maximum value of at . Hence, from (14) and (15), we have:
Theorem 4**.**
The worst-case risk of the Good-Turing estimator satisfies the following bounds:
[TABLE]
Therefore, the constant in (14) is fairly tight.
III Lower Bounds on the Minimax Risk
In this section, we consider lower bounds on the squared error risk of an arbitrary estimator of missing mass. The main result is that the minimax risk is lower-bounded by for a constant . Two methods are described for finding lower bounds - the first one is a Dirichlet prior approach, and the second one is reduction of the missing mass problem to a distribution estimation problem.
Both approaches provide the same order of for the lower bound, but the second reduction approach provides a better constant. However, the Dirichlet prior approach has significant potential for further optimization for better constants, and is an interesting extension of the standard prior method to the case of estimation of random variables such as missing mass, which depend on both the distribution and the sample .
III-A Lower Bounds via Prior Distributions
The first approach is to bound the minimax risk by the average risk obtained by averaging over a family of distributions with a prior. Let be a random variable over a family of distributions , having an alphabet . In the following section, the missing mass will be denoted as to explicitly show the dependence on the distribution .
Lemma 5**.**
For any missing mass estimator and a random variable over a family of distributions ,
[TABLE]
Proof:
[TABLE]
where (a) follows from the law of total expectation and (b) follows from the fact that (a) is minimized when . ∎
Lemma 5 gives us a family of bounds depending on the distribution of the prior . The RHS in Lemma 5 can be computed exactly for a Dirichlet prior with some analysis.
Lemma 6**.**
Suppose has a Dirichlet distribution , where . Then, we have
[TABLE]
where is the Beta function and .
We skip the details for want of space.
Let and . For this choice of parameters, the expression in Lemma 6 can be bounded as
[TABLE]
where, once again, we skip the details. The coefficient of attains a maximum value of when , which results in the following bound on the minimax risk:
Lemma 7**.**
[TABLE]
The bound is worse than the bound obtained from distribution estimation in the next section, but it can possibly be improved by better selection of the prior.
III-B Lower bounds via Distribution Estimation
To bound the minimax risk for missing mass estimation, one approach is to reduce the problem to that of estimating a distribution. Let be the set of distributions over the set such that for all , . A known result (refer [19, 20] for instance) states that the minimax loss in estimating is . More precisely, let be an estimator for from a random sample distributed according to . Then, we have
Lemma 8**.**
[TABLE]
For an arbitrary positive integer , let be the set of distributions over the set , such that for any , we have and for all . We can use Lemma 8 to obtain minimax bounds in estimating for this family of distributions as well. Let be an estimator for from a random sample distributed according to . Let be the probability assigns to the symbol .
Lemma 9**.**
[TABLE]
Proof:
Suppose we want to estimate an unknown distribution and we have an estimator for distributions in . Then we can use to estimate as follows. Take the observed sample distributed according to , and if it is 0, keep it as it is. If it is 1, then replace it with an uniformly sampled random variable over . The result of this sampling process is a distribution in with . Thus, any estimator for distributions in can be reduced to an estimator for distributions in and
[TABLE]
and the proof follows from Lemma 8. ∎
Lemma 10**.**
Let . With probability at least , the missing mass satisfies
[TABLE]
Proof:
Probability of symbol [math] appearing at least once in is . Furthermore, at most distinct symbols from can appear in . Hence, with probability , the observed mass satisfies
[TABLE]
and hence follows the lemma. ∎
From Lemmas 9 and 10, we can obtain a lower bound of on the minimax risk of missing mass estimation. Combining the lower bound with the upper bound on the risk of the Good-Turing estimator from Theorem 4, we have the following:
Theorem 11**.**
The minimax risk of missing mass estimation, denoted , satisfies the following bounds:
[TABLE]
IV Summary and Future Directions
We studied the problem of missing mass estimation and showed that the minimax risk lies between and . We further showed that the risk of the Good-Turing estimator lies between and .
Our results pose several interesting questions for future work. Two natural questions are: (1) are there priors which yield better lower bounds on the minimax risk of missing mass? and (2) are there estimators that have better risk than the Good-Turing estimator?
We finally remark that it might be interesting to see if the minimax risk results imply better concentration results for the missing mass and the Good-Turing estimator.
V Acknowledgements
Authors thank Alon Orlitsky for helpful discussions. Ananda Theertha Suresh thanks Jayadev Acharya for helpful comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] W. A. Gale and G. Sampson, “Good-Turing frequency estimation without tears,” Journal of Quantitative Linguistics , vol. 2, no. 3, pp. 217–237, 1995.
- 2[2] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” in Proceedings of the 34th Annual Meeting on Association for Computational Linguistics , ser. ACL ’96, 1996, pp. 310–318.
- 3[3] A. Chao and S.-M. Lee, “Estimating the number of classes via sample coverage,” Journal of the American statistical Association , vol. 87, no. 417, pp. 210–217, 1992.
- 4[4] I. J. Good, “The population frequencies of species and the estimation of population parameters,” Biometrika , vol. 40, no. 3-4, pp. 237–264, 1953.
- 5[5] V. Q. Vu, B. Yu, and R. E. Kass, “Coverage-adjusted entropy estimation,” Statistics in medicine , vol. 26, no. 21, pp. 4039–4060, 2007.
- 6[6] T.-J. Shen, A. Chao, and C.-F. Lin, “Predicting the number of new species in further taxonomic sampling,” Ecology , vol. 84, no. 3, pp. 798–804, 2003.
- 7[7] R. K. Colwell, A. Chao, N. J. Gotelli, S.-Y. Lin, C. X. Mao, R. L. Chazdon, and J. T. Longino, “Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages,” Journal of Plant Ecology , vol. 5, no. 1, pp. 3–21, 2012.
- 8[8] D. A. Mc Allester and R. E. Schapire, “On the convergence rate of Good-Turing estimators,” in Proceedings of the Thirteenth Annual Conference on Computational Learning Theory , ser. COLT ’00. Morgan Kaufmann Publishers Inc., 2000, pp. 1–6.
