Consistent estimation of the missing mass for feature models
Fadhel Ayed, Marco Battiston, Federico Camerlenghi, Stefano Favaro

TL;DR
This paper investigates the challenge of estimating the unseen features in feature models, proving the non-existence of a universal consistent estimator and demonstrating the consistency of a specific estimator under heavy-tailed probabilities.
Contribution
It establishes the impossibility of universal consistent estimation of the missing mass and shows the consistency of a particular estimator for heavy-tailed feature probabilities.
Findings
No universally consistent estimator exists for the missing mass.
The estimator by Ayed et al. (2017) is strongly consistent for heavy-tailed probabilities.
Derived concentration inequalities for missing mass and feature frequency counts.
Abstract
Feature models are popular in machine learning and they have been recently used to solve many unsupervised learning problems. In these models every observation is endowed with a finite set of features, usually selected from an infinite collection . Every observation can display feature with an unknown probability . A statistical problem inherent to these models is how to estimate, given an initial sample, the conditional expected number of hitherto unseen features that will be displayed in a future observation. This problem is usually referred to as the missing mass problem. In this work we prove that, using a suitable multiplicative loss function and without imposing any assumptions on the parameters , there does not exist any universally consistent estimator for the missing mass. In the second part of the paper, we focus on a special class of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Methods and Inference · Markov Chains and Monte Carlo Methods
Consistent estimation of the missing mass for feature models
Fadhel Ayed [email protected] University of Oxford
Marco Battiston [email protected] Lancaster University
Federico Camerlenghi [email protected] University of Milano–Bicocca
Stefano Favaro [email protected] University of Torino
Abstract
Feature models are popular in machine learning and they have been recently used to solve many unsupervised learning problems. In these models every observation is endowed with a finite set of features, usually selected from an infinite collection . Every observation can display feature with an unknown probability . A statistical problem inherent to these models is how to estimate, given an initial sample, the conditional expected number of hitherto unseen features that will be displayed in a future observation. This problem is usually referred to as the missing mass problem. In this work we prove that, using a suitable multiplicative loss function and without imposing any assumptions on the parameters , there does not exist any universally consistent estimator for the missing mass. In the second part of the paper, we focus on a special class of heavy-tailed probabilities , which are common in many real applications, and we show that, within this restricted class of probabilities, the nonparametric estimator of the missing mass suggested by Ayed et al. (2017) is strongly consistent. As a byproduct result, we will derive concentration inequalities for the missing mass and the number of features observed with a specified frequency in a sample of size .
Keywords: Feature models; missing mass; multiplicative consistency; regular variation; nonparametric estimator.
1 Introduction
Feature models generalize species sampling models by allowing every observation to belong to more than one species, now called features. In particular, every observation is endowed with a finite set of features selected from a (possibly infinite) collection of features . Every feature is associated with an unknown probability , and each observation displays feature with probability . We may conveniently represent each observation with a binary sequence, whose entries indicate the presence (1) or absence (0) of each feature. Feature models have been first applied in ecology for modeling incidence vectors collecting the presence or absence of species traps (Colwell et al. (2012) and Chao et al. (2014)), and more recently in several fields of biosciences, such as the study of genetic variation and protein interactions (Chu et al. (2006), Ionita-Laza et al. (2009), Ionita-Laza et al. (2010) and Zou et al. (2016)). They also found applications in the analysis of choice behaviour arising from psychology, marketing and computer science (Görür et al. (2006)); in the context of binary matrix factorization for modeling dyadic data to design recommender system (Meeds et al. (2007)); in graphical models (Wood et al. (2006) and Wood & Griffiths (2007)); in cognitive psychology for the analysis of similarity judgement matrices (Navarro & Griffiths (2007)); in the context of independent component analysis and sparse factor analysis (Knowles & Ghahramani (2007)); in link prediction using network data (Miller et al. (2010)).
The Bernoulli product model is arguably the most popular feature model. It assumes that the –th observation is a sequence of independent Bernoulli random variables with unknown success probabilities , and that is independent of for any . Therefore , namely the number of times that feature has been observed in a sample , is a Binomial random variable with parameter for any . Recently, the Bernoulli product model has been extensively applied to the fundamental problem of discovering genetic variation in human populations. See, e.g., Ionita-Laza et al. (2009), Zou et al. (2016)) and references therein. In such a context, interest is in estimating the conditional expected number, given a sample , of hitherto unseen features that would be observed if an additional sample was collected, namely
[TABLE]
where is the indicator function. The statistic is referred to as the missing mass, i.e. the sum of the probability masses of unobserved features in a sample of size . In genetics, interest in estimating (1) is motivated by the ambitious prospect of growing databases to encompass hundreds of thousands of genomes, which makes important to quantify the power of large sequencing projects to discover new genetic variants (Auton et al. (2015)). An accurate estimate of the missing mass provides a quantitative evaluation of the potential and limitations of these datasets, providing a roadmap for large-scale sequencing projects.
Let denote an arbitrary estimator of . For easiness of notation, in the rest of the paper we will not highlight the dependence on and , and we simply write and . Motivated by the recent works of Ohannessian & Dahleh (2012), Mossel & Ohannessian (2015), Ben-Hamou et al. (2017) and Ayed et al. (2018) on the estimation of the missing mass in species sampling models, in this paper we consider the problem of consistent estimation of under the Bernoulli product model. The classical notion of additive consistency, involving the large limiting behaviour of , is not suitable in the context of the estimation of . This is because , as , which implies that [math] is a consistent estimator of the missing mass for any sequence . Hence, in such a framework, one should invoke a more adequate notion of consistency, which allows to achieve more informative results. This notion of consistency is based on the limiting behaviour of the multiplicative loss function
[TABLE]
More precisely we say that the estimator is multiplicative consistent for if as , either almost surely or in probability. The multiplicative loss function has been already used in statistics, e.g. for the estimation of small value probabilities using importance sampling (Chatterjee & Diaconis (2018)) and for the estimation of tail probabilities in extreme value theory (Beirlant & Devroye (1999)). We show that there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass . That is, under the Bernoulli product model and the loss function (2), we prove that for any estimator of there exists at least a choice of for which does not converge to in probability, as . The proof relies on non-trivial extensions of Bayesian nonparametric ideas and techniques developed by Ayed et al. (2018) for the estimation of the missing mass in species sampling models. In particular, the key argument makes use of a generalized Indian Buffet construction (James (2017)), which allows to prove inconsistency by exploiting properties of the posterior distribution of . Our inconsistency result is the natural counterpart for feature models of the work of Mossel & Ohannessian (2015), showing the impossibility of estimating the missing mass without imposing any structural (distributional) assumption on the ’s. We complete our study by investigating the consistency of an estimator of recently proposed by Ayed et al. (2017). To the best of our knowledge this is the first nonparametric estimator of , in the sense that its derivation does not rely on any distributional assumption on the ’s. We show that the estimator of Ayed et al. (2017) is strongly consistent, in the multiplicative sense, under the assumption that the tail of decays to zero as a regularly varying function (Bingham et al. (1987)). The proof relies on novel concentration inequalities for , as well as for related statistics, which are of independent interest.
The paper is structured as follows. In Section 2 we prove that for the Bernoulli product model there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass . Section 3 introduces some exponential tail bounds for , as well as for related statistics, which are then applied in Section 4 to show that the estimator of in Ayed et al. (2018) is consistent under the assumption of regularly varying probabilities ’s.
2 Non existence of universally consistent estimators of the missing mass
Consider the Bernoulli product model described in the Introduction. Without loss of generality, we assume that each feature is labeled by a value in and therefore is a sequence of distinct points in . Furthermore, the probabilities are assumed to be summable, i.e. ; this condition is needed in order to guarantee that every observation will display only a finite number of features almost surely. Indeed, is equivalent to , which in turns implies almost surely, by Tonelli-Fubini Theorem. The two unknown sequences and can be uniquely encoded in a finite measure on , , with all masses smaller than one. We can therefore consider as parameter space the set
[TABLE]
Recall that denotes the number of times that feature has been observed in the sample , that is is a Binomial random variable with parameter . For a fixed , an estimator of the missing mass is a measurable map which argument is the observed sample . We say that the estimator is multiplicative consistent under the parameter space if for every and every ,
[TABLE]
where denotes the law of the observations under a feature allocation model of parameter . Theorem 2.1 shows that there are no universally multiplicative consistent estimators of for the class . This means that for any estimator of the missing mass, there exists at least one element for which does not converge to in probability, as .
Theorem 2.1
Under the feature allocation model, there are no universally consistent estimators, i.e. there are no estimators satisfying (4). In particular, for every estimator , it is possible to find an element such that for any
[TABLE]
for some strictly positive constant .
2.1 Proof of Theorem 2.1
In order to prove Theorem 2.1, it is enough to show that for every estimator and every ,
[TABLE]
and therefore there exists a for which is not consistent.
First, let us notice that, for every ,
[TABLE]
Indeed, if , then
[TABLE]
and, from the lower bound of (8), . Because , it follows that . This last inequality together with (8) leads to . Considering the complements of the two events, it follows that
[TABLE]
and, as a consequence, , proving (7). From now on, we will denote and prove that
[TABLE]
for some strictly positive constant .
The main idea of the proof is in the following formula and works as follows: we lower bound the supremum over in (7) by an average with respect to a (carefully chosen) prior for ; we swap the conditional distribution of and the marginal of with the conditional of and the marginal of ; we lower bound the event probability with respect to the posterior of given . Formally,
[TABLE]
where we have applied reverse Fatou’s lemma to take the outside the expectation. In (10), denotes the expectation with respect to the prior for , the expectation with respect to the marginal distribution of and the probability under the posterior of given .
Our choice of the nonparametric prior for is based on completely random measures (see Daley and Vere-Jones (2008)) and the generalized Indian Buffet process prior of James (2017). In particular, a prior for can defined through a completely random measure on , where is a Poisson Point Process on , by setting . We select to be a completely random measure with Lévy intensity . The distribution of is completely characterized by its Laplace functional defined as follows,
[TABLE]
for any measurable function . See also Kingman (1993).
Theorem 3.1 of James (2017) provides with a distributional equality for the posterior of given . Denoting by the distinct features observed in , we have the following distributional equality
[TABLE]
where the ’s are non-negative random jumps and is an independent completely random measure with updated Lévy intensity .
Defining , from (11) we have that, for any Borel set in , the missing mass satisfies
[TABLE]
showing that the posterior distribution of the missing mass is equal in distribution to the random variable . Besides, it is worth to introduce the random variable
[TABLE]
whose distribution can be computed exactly and turns out to be a Gamma random variable of parameters . Indeed, from the Laplace functional, for every we have
[TABLE]
which is the characteristic function of a Gamma random variable.
We now have all the necessary ingredients to prove the lower bound (10). Fix . First note that, the inverse triangular inequality entails
[TABLE]
which implies
[TABLE]
indeed, thanks to (13), the two events together
[TABLE]
imply that
[TABLE]
where the last inequality follows from the fact that . Hence, from (14), we have that
[TABLE]
which may be plugged into (10) to obtain
[TABLE]
We are going to lower bound separately the two terms on the r.h.s. of (15). With regard to the first term, let us observe that the elementary inequality , for , implies that for all
[TABLE]
Summing over ,
[TABLE]
and therefore,
[TABLE]
As a simple consequence of the last inequality, for any , the event implies the validity of and therefore we can upper bound the first term in (15) as follows
[TABLE]
where we have used the fact that the posterior distribution of is .
Let us now consider the second term on the r.h.s. of (15). Using again the fact that is Gamma distributed and , we have
[TABLE]
it is now easy to see that the function is strictly positive, continuous and admits a global minimum on at the point , therefore
[TABLE]
Using the two bounds (16) and (17) in (15), for any we get
[TABLE]
which completes the proof.
3 Concentration inequalities for feature models
In this section we will establish exponential tail bounds for the missing mass and the statistic defined by
[TABLE]
which counts the number of features observed with frequency in the sample . The statistic is of interest in different applications of feature allocation models and its analysis will be important for the study of the estimator of missing mass considered in Section 4, which involves . The tail bounds we present in this Section are valid in full generality, i.e. without any assumptions on the probability masses . In Section 4, we will use these results to prove consistency results under the assumption of regularly varying heavy tails .
In order to derive the concentration inequalities for we will use Chernoff bounds, which require suitable bounds on the log-Laplace transform. First, let us recall some definitions from Boucheron et al. (2013) and Ben-Hamou et al. (2017).
Definition 3.1
Let be a real–valued random variable defined on some probability space, then:
- i.
* is sub-Gaussian on the right tail (resp. on the left tail) with variance factor if for any (resp. )*
[TABLE]
- ii.
* is sub-Gamma on the right tail with variance factor and scale parameter if*
[TABLE]
- iii.
* is sub-Gamma on the left tail with variance factor and scale parameter if is sub-gamma on the right tail with variance factor and scale parameter ;*
- iv.
* is sub-Poisson with variance factor if for all *
[TABLE]
being .
Note that a sub-Gaussian random variable is also sub-Gamma for any choice of the scale parameter , but in general the inverse is not true. As we will see in the sequel, the bounds on the log-Laplace (18)–(19) imply exponential tails bounds by means of the Chernoff inequality. See Boucheron et al. (2013) for the details.
The following proposition shows that the missing mass is sub-Gaussian on the left tail and sub-Gamma on the right one.
Proposition 3.1
Let . On the left tail, the random variable is sub-Gaussian with variance factor , i.e. for any it holds
[TABLE]
On the right tail, the random variable is sub-Gamma with variance factor and scale parameter , i.e. for any one has
[TABLE]
Proof.
We first focus on the proof of (21). Let , exploiting the independence of the random variables ’s and the elementary inequality , valid for any , we obtain
[TABLE]
We observe that, being , one has:
[TABLE]
hence (21) has been proven.
We now concentrate on the proof of (22), arguing exactly as before we obtain that
[TABLE]
where we have used the infinite series representation for the exponential function. Fixing the useful notation
[TABLE]
and observing that , for any , we get
[TABLE]
for any . Proceeding along similar lines as in (Gnedin et al., 2007, Lemma 1), it is not difficult to see that
[TABLE]
which entails , for any . The last inequality can be used to provide an upper bound for the r.h.s. of (23) as follows
[TABLE]
and (22) has been now proved. ∎
As already mentioned at the beginning of this section, the sub-Gaussian and sub-Gamma bounds obtained in Proposition 3.1 imply useful exponential tail bounds for (see Boucheron et al. (2013)). More specifically we have that:
Corollary 3.1
For any and , the following hold
[TABLE]
Proof.
The two inequalities follow by the Chernoff bound and the log-Laplace bound proved in Proposition 3.1. This is a standard argument, see Boucheron et al. (2013) for details. ∎
Proceeding along similar lines as before we show that is a sub-Poisson random variable, this result is implicitly proved in the Supplementary material by Ayed et al. (2017), but for the sake of completeness we report it also here.
Proposition 3.2
For any and , the random variable is sub-Poisson with variance factor . Indeed, for any the following bound holds true
[TABLE]
where .
Proof.
Exploiting the independence of the random variables ’s, for any we can write:
[TABLE]
where we have used the inequality , for any . ∎
The previous proposition and the Chernoff bounds imply an exponential tail bound for , indeed one can prove that
Corollary 3.2
For any , and the following holds true
[TABLE]
Corollary 3.1 and 3.2 provide us with concentration inequalities of the missing mass and the statistic , respectively, around their mean. These results have been derived without any assumption on the probabilities and hold for all elements of . In the next Section, we will focus on the class of regularly varying probabilities and, after recalling the nonparametric estimator proposed by Ayed et al. (2018) we will prove that this estimator is consistent within such a subset of .
4 A consistent estimator for regularly varying feature probabilities
Ayed et al. (2018) have introduced a nonparametric estimator of the missing mass, defined as follows
[TABLE]
Namely, is the number of features having frequency one divided by the sample size . Such an estimator is attractive both from a theoretical and a computational standpoint. Indeed, on the one side, it admits two different interpretations as a Jackknife estimator in the sense of Quenouille (1956) and as a non-parametric empirical Bayes estimator in the same spirit as Efron and Morris (1973); on the other side, it is feasible and easy to implement. See Ayed et al. (2018) for details. Here we want to study the consistency of (26). In order to do this we have seen that, without assumptions on the features’ proportions, any estimator of the missing mass is always inconsistent (Theorem 2.1), hence we study the consistency of (26) under the ubiquitous assumption of heavy tailed probabilities . We rely on the theory of regular variation by Karamata, J. (1930, 1933) (see also Karlin (1967)) to define a suitable class of heavy-tailed , showing that, under this class, turns out to be multiplicative consistent.
We use the limiting notation to mean ; we further write if there exists a fixed constant such that . Then, similarly as done by Karlin (1967) we give the following
Definition 4.1
Let and define the measure , which is the cumulative count of all features having no less than a certain probability mass. We say that is regularly varying with regular variation index if as , where is a slowly varying function, that is as for all .
Let us remark that if we denote the sorted probabilities in decreasing order, definition 4.1 is equivalent to
[TABLE]
as , where is another slowly varying function. For simplicity, the relation between , and is skipped here, interested readers can refer to Lemma 22 and Proposition 23 of Gnedin et al. (2007). Definition 4.1 is in the same spirit as Karlin (1967), but for our purposes here we consider the case , while in Karlin (1967) the ’s satisfy the more restrictive condition . The next theorem is similar to a result proved by Karlin (1967) and provides the first order asymptotic of .
Theorem 4.1
Let be regularly varying with . If denotes the Gamma function, then as , .
Proof.
It is worth to recall the notation already used in Section 3
[TABLE]
roughly speaking can be considered an asymptotic approximation of . Indeed, in order to prove the theorem, we first show that as , and then we prove that . In order to prove the former asymptotic equivalence it is worth noticing that (Gnedin et al., 2007, Proposition 13) applies also for the feature setting under regularly varying heavy tails, indeed the measure defined by is such that
[TABLE]
Since is the Laplace transform of multiplied by a suitable quantity, we can apply Tauberian theorems to connect the asymptotic behaviour of the cumulative distribution function of given in (27) to that of . In particular, from Tauberian theorems (see Feller (1971)), we obtain
[TABLE]
as . As a byproduct of (28), we get . Finally to show , we can easily observe that (Gnedin et al., 2007, Lemma 1) applies in this setting as well, hence there exists a constant such that
[TABLE]
as . From (29), along with , we obtain
[TABLE]
in other words we have shown that as .
∎
We are now ready to prove that is multiplicative consistent, when the feature probabilities are regularly varying. In the proof we will use the concentration inequalities of Section 3 along with Theorem 4.1 to tune the concentration inequalities under the assumption of regular variation.
Proposition 4.1
Let be regularly varying with index . Let be the nonparametric estimator of the missing mass in a sample of size , then is strongly multiplicative consistent, i.e. .
Proof.
In order to prove the multiplicative consistency we first show that and that . As for the former convergence, we can use the concentration inequality (25) given in Corollary 3.2 when , which, for any , gives
[TABLE]
When is fixed, we can use the asymptotic in Theorem 4.1 to say that
[TABLE]
which implies that for any , by the first Borel-Cantelli lemma, hence .
Analogously we may use Corollary 3.1 to prove the almost sure convergence to of the ratio . Indeed, for any , we have
[TABLE]
By observing that , the previous upper bound boils down to
[TABLE]
Now, using again Theorem 4.1, it is not difficult to see that for any fixed
[TABLE]
then, by the first Borel-Cantelli lemma, we get , as well.
By the previous results the consistency of easily follows, indeed
[TABLE]
since all the ratios on the r.h.s. converge to almost surely. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Auton et al. (2015) Auton, A. et al. (2015). A global reference for human genetic variation. Nature , 526 , 68–74.
- 2Ayed et al. (2017) Ayed, F., Battiston, M., Camerlenghi, F. & Favaro, S. (2018). A Good-Turing estimator for feature allocation models. Submitted .
- 3Ayed et al. (2018) Ayed, F., Battiston, M., Camerlenghi, F. & Favaro, S. (2018). On consistent estimation of the missing mass. Preprint ar Xiv:1806.09712 .
- 4Ben-Hamou et al. (2017) Ben-Hamou, A., Boucheron, S. & Ohannessian, M.I. (2017). Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli , 23 , 249–287.
- 5Beirlant & Devroye (1999) Beirlant, J. & Devroye, L. (1999) On the impossibility of estimating densities in the extreme tail. Statistics and Probability Letters , 43 , 57–64
- 6Bingham et al. (1987) Bingham, N.H., Goldie, C.M. & Teugels, J.L. Regular Variation. Cambridge University Press.
- 7Boucheron et al. (2013) Boucheron, S., Lugosi, G. & Massart, P. (2013). Concentration inequalities. Oxford University Press.
- 8Chao et al. (2014) Chao, A., Gotelli, N.J., Hsieh, T.C., Sander, E.L., Ma, K.H., Colwell, R.K. & Ellison, A.M. (2014). Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies. Ecological Monographs , 84 , 45–67.
