A truncation model for estimating Species Richness
Fran\c{c}ois Koladjo, Mesrob I. Ohannessian, \'Elisabeth Gassiat

TL;DR
This paper introduces a semiparametric truncation model for estimating species richness, incorporating an unknown threshold to distinguish rare from abundant counts, and demonstrates its efficiency and relation to existing estimators.
Contribution
It proposes a novel semiparametric truncation model with an unknown threshold for species richness estimation, including new estimators with proven asymptotic efficiency.
Findings
The proposed estimators are asymptotically efficient.
The model recovers Chao's lower bound estimator as a special case.
Simulation results show competitive performance compared to existing methods.
Abstract
We propose a truncation model for abundance distribution in the species richness estimation. This model is inherently semiparametric and incorporates an unknown truncation threshold between rare and abundant counts observations. Using the conditional likelihood, we derive a class of estimators for the parameters in the model by a stepwise maximisation. The species richness estimator is given by the integer maximising the binomial likelihood when all other parameters in the model are know. Under regularity conditions, we show that the estimators of the model parameters are asymptotically efficient. We recover the Chaos lower bound estimator of species richeness when the model is a unicomponent Poissons model. So, it is an element of our class of estimators. In a simulation study, we show the performances of the proposed method and compare it to some others.
| Mean | Inf () | Sup () | Mean | Inf | Sup | Mean | Inf | Sup | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | Inf | Sup | Mean | Inf | Sup | Mean | Inf | Sup | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Est | Mean | rMAE | rMSE | Mean | rMAE | rMSE | Mean | rMAE | rMSE | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | ||||||||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCensus and Population Estimation · Data-Driven Disease Surveillance · Bayesian Methods and Mixture Models
A truncation model for estimating Species Richness
François Koladjo1, Mesrob I. Ohannessian2, Elisabeth Gassiat3 Corresponding author. Email address: [email protected]
Abstract
We propose a truncation model for abundance distribution in the species richness estimation. This model is inherently semiparametric and incorporates an unknown truncation threshold between rare and abundant counts observations. Using the conditional likelihood, we derive a class of estimators for the parameters in the model by a stepwise maximisation. The species richness estimator is given by the integer maximising the binomial likelihood when all other parameters in the model are know. Under regularity conditions, we show that the estimators of the model parameters are asymptotically efficient. We recover the Chaos lower bound estimator of species richeness when the model is a unicomponent Poissons model. So, it is an element of our class of estimators. In a simulation study, we show the performances of the proposed method and compare it to some others.
1Université de Parakou, ENSPD BP 55 Tchaourou, Bénin
2Toyota Technological Institute at Chicago
3Laboratoire de Mathématiques d’Orsay, Univ. Paris-Sud, CNRS, Université Paris-Saclay, 91405 Orsay, France.
1 Introduction
We consider the “species richness” problem, also known as the problem of estimating the number of species, which arises when a sample of individuals is taken from a population with classes or species. The usual data set is a series of observed counts with being the total number of distinct species observed in the sample and is the parameter to be estimated. Estimating using such abundance data is an old problem that has been tackled in several ways, both by parametric models, including Bayesian models (Bunge & Barger, (2008), Barger & Bunge, (2008)), and by nonparametric models (Wang, (2010)). Due to their flexibility to account for heterogeneity, the nonparametric approaches are those predominantly considered in the last two decades. This setting contains among others the Chao-type estimators developed by Chao and collaborators (see for example Chao & Lee, (1992), Chao & Bunge, (2002), Chao & Jost, (2012)), and the likelihood-based nonparametric estimators of which one can cite Norris & Pollock, (1996, 1998)
Many of these methods, although theoretically founded on a single model, perform the common practice of truncating the data into abundant and rare species. One then assumes that the number of abundant species is adequately represented by the number of distinct such species, whereas the same number leads to an underestimate for the rare species and thus necessitates a correction. Such truncation is generally justified on the basis of avoiding instability. This, however, forces even initially nonparametric models to become effectively parametric, while losing the original hypothesis and the accompanying theoretical guarantees. This motivated us to study this heuristic in a more rigorous light. In particular, we make the following contributions:
- •
We give an explicit semiparametric model to represent this truncation practice, where the abundant species are represented by an arbitrary abundance distribution whose support is offset away from the rare range. We partially motivate this as arising from the commonly used Poisson mixtures as being inappropriate for modeling more abundant species.
- •
We show that the practice of pure truncation as described above is justified only when the abundant and rare species have abundance distributions whose supports are disjoint. In this case truncation leads to an efficient estimation of the number of species.
- •
In general, although pure truncation is not efficient, accounting for the support overlap leads to a hybrid truncation that is a semiparametric procedure which is efficient. We show this by using standard single-parameter families to derive a local minimax bound and a matching (asymptotically) efficient estimator. Coincidentally, we show that this framework recovers several previously suggested estimators as special cases.
- •
When the abundance threshold is not known, neither pure truncation nor the hybrid approach can be used directly. For this reason, the proper offset should be obtained from data. We present a model selection approach to resolve this problem. Our experiments show that this approach adapts to the true unknown offset, in the sense that the resulting estimator achieves (almost) the same asymptotic performance as knowing the offset.
- •
We illustrate this estimator on both synthetic and real data, showing that our more refined analysis leads to practical improvement.
2 Model and Estimator
2.1 Problem Statement
Assume that species exist in nature and that each is represented by individuals in a sample. We call the abundance of species in the sample. A classical statistical model of the abundances is to assume that they are independent random variables identically distributed according to a distribution for and where is an index within a class of abundance distributions. One of the more common choices of abundance distribution classes are Poisson mixtures indexed by a mixing distribution on :
[TABLE]
Of course, we do not get to access non-observed species, i.e. species for which . If we let denote the number of distinct observed species, i.e. , and if we re-index and relabel those species as , then it is easy to show that these observed abundances are independent and identically distributed according to the zero-truncated distribution:
[TABLE]
The central problem of this paper is that of estimating the number of species with a functional of the abundances of the observed species. In other words, needs to complement the number of observed species with an estimate of the number of non-observed species.
As outlined in the introduction, a long line of research has addressed this problem. But we focus here in particular on a sequence of influential papers (Chao & Lee, (1992), Chao & Yang, (1993)), the methodology of which continues to be used in more recent papers such as Chao & Bunge, (2002) and Wang & Lindsay, (2005). In theory, these results are within the current framework but, in practice, the estimation is done as follows. The data is divided into rare and abundant components according to an abundance threshold . Although their estimators are derived and analyzed under the general model, the theoretical estimators are fed with only those abundances such that , to yield an estimator of the number of rare species . For the abundant species, they use the trivial estimator:
[TABLE]
The estimate for the total number of species is then simply the sum of both:
[TABLE]
What is the justification behind such truncation? This paper strives to answer this question and to give a more principled model of this common practice, thus leading to a more transparent methodology.
2.2 Truncation Model
To motivate the reason behind truncation, note that the justification often given in this line of work (Chao & Lee, (1992), Chao & Yang, (1993), Chao & Bunge, (2002), Wang & Lindsay, (2005)) is that including the abundant species into the estimator may cause instabilities. We can interpret this as the abundance sampling model, and in particular the Poisson mixture model, as not being a good model for abundant species. In this section, we first give some informal insight as to why this may be the case. We then proceed to present an explicit model to handle this rare-abundant dichotomy.
The abundance model can be traced back to a simple sampling model where individuals are drawn independently and identically (with replacement) from a population, where the frequency of species is . If such individuals are drawn, let denote their species. In this model, the abundance of species has therefore a binomial distribution of parameters and :
[TABLE]
If the species are not labeled a priori, which corresponds to a random permutation among the species, then the distribution of a particular abundance becomes a mixture of binomial distributions, with mixture weights at . Note that these abundances are not independent as in the abundance model, but are rather exchangeable. Notwithstanding this fact, we can see that the abundance model of Equation (1) effectively replaces this binomial mixture with a Poisson mixture, which cannot be accurate for abundant species.
The source of the instability is due to the fact that a Poisson distribution with a large mean places much more mass near [math] compared to a corresponding binomial. More precisely, if the model substitutes a binomial mixture with a Poisson mixture, then when an estimator places a mixing mass at a higher abundance, it contributes more to than a binomial would. This is then interpreted as evidence of more unseen species than the reality, and thus is overestimated. This is indeed what is observed with such estimators: with larger values of the truncation , the estimate of tends to increase (see for example the last three columns of Table 2, page 949, and the last two columns of Table 13, page 956, in Wang & Lindsay, (2005)). That said, simply truncating the data is not a theoretically sound approach since the resulting samples no longer follow the hypothesized model. For example Poisson distributions place a positive mass, even if small, beyond any threshold. There is therefore a need to rigorously model rare species, say with mixtures of Poisson distributions, while capturing the possibility that there may be abundant species that have much less influence on our inference about the rare species.
In this paper, we propose the following semiparametric alternative. Let and let be the family of discrete distributions supported on . We assume that abundances follow a distribution that belongs to a model , as follows. For define:
[TABLE]
with , where is a parametric model that represents mostly rare species (e.g. we may think of a finite mixture of Poisson distributions, and more generally we ask for , where is an appropriate subset of for some ), and where is a nonparametric component that represents abundant species.
Using Equation (2) and the fact that the nonparametric component vanishes at , this model induces the zero-truncated version as follows:
[TABLE]
We leave the choice of open, except for certain identifiability and smoothness assumptions that we later spell out in detail. Thus is not necessarily a Poisson mixture. The choice of a parametric model for is justified by the fact that even originally nonparametric models are effectively reduced to parametric classes under the constraint of identifiability from a small (truncated) support.
It is now clear that our model in Equation (5) makes explicit the notion that rare and abundant species may coexist. This allows us to bypass heuristics and suggest estimators with provable performance guarantees. In particular, we may harness the basic theory of semiparametric models to establish the efficiency of likelihood-based estimators, and suggest potential model selection mechanism for the choice of . Furthermore, as we make no further assumptions beyond adopting a parametric form for the rare component and dislocating the support of the abundant species away from zero, we have a model that can go beyond a simple justification of truncation. For example, one may think of as a nonparametric corruption to the data, rather than a legitimate measurement of abundant species, and our analysis and methodology still goes through unaffected.
2.3 Estimator of the Number of Species
The estimator that we propose for falls under the category of maximum likelihood (MLE)-type M-estimators. In this section we derive and define the estimator, and in the next section we study some of its asymptotic properties.
Let , be the empirical counts of the abundances, which are sufficient statistics for computing likelihoods. The combined likelihood of and the rest of the model parameters given the samples can then be written as follows:
[TABLE]
It is interesting to note that this likelihood may be decomposed into the product of two likelihoods. The first is the likelihood of given the rest of the model parameters and the number of distinct samples . This has a binomial form and we denote it by . The second is the likelihood of the rest of the model parameters, given the samples, and we denote it by . After substituting by its expression in , in particular letting , and noting that , we can write these likelihoods respectively as follows:
[TABLE]
and
[TABLE]
suggesting two methods to undertake the maximum likelihood estimation from . Note that some of the earliest works to suggest such a decomposition were Sanathanan, (1972, 1977) (see also Mao & Lindsay, (2003, 2007) for a more recent treatment).
The first method is to maximize directly the likelihood over all of The estimator of obtained from this method is typically called the unconditional maximum likelihood estimator. For example, some nonparametric models with unconditional estimation methods are proposed in Norris & Pollock, (1996), Böhning & Schün, (2005). The second method to obtain a maximum likelihood estimator of is to first maximize the likelihood from the zero-truncated model to derive the estimators of and and then to maximize the binomial likelihood in the parameter given that and are known. This method is known as the conditional maximum likelihood method for estimating .
We consider here only the conditional maximum likelihood method. Before we proceed with the estimation of , , and , note that maximizing the binomial likelihood in Equation (7) gives us the form of our estimator:
[TABLE]
The final expression for the estimator therefore consists of estimating by and by , in a manner that we shortly outline, and then substituting in Equation (9) to obtain . Of course, is an integer parameter, and we could then take the integer part of the resulting estimate. That said, in what follows we allow ourselves to accept non-integer estimates.
We now proceed to estimate the parameters. We observe first that since plays no role in the expression for , we can treat it as a nuisance parameter. The next observation is that to maximize , we can successively fix some parameters while we maximize over others. Because is mostly a nuisance parameter, we maximize the likelihood when and are fixed without further constraining to be a proper distribution. This approach gives us, at each support point , the following pseudo-estimator, as a function of and :
[TABLE]
where denotes the number of species with abundance no greater than .
The reason we call a pseudo-estimator is that it may put negative mass at some of its support points as it is not constrained to be nonnegative. This occurs for example at the non-observed support points of , that is for a support point such that . Despite this fact, the estimators for and that follow from this choice of are not sensitive to its impropriety.
Replacing by its pseudo-estimator in the expression for leads to an objective function for and which may now be maximized in . This leads to an MLE-type estimator of , still as a function of :
[TABLE]
Note that is always non-negative. However, for particular values of , , and , it could be larger than . If this occurs in practice, we simply constrain it to to obtain a valid probability. The consistency result in the next section shows that this is not a concern, asymptotically.
The last step is to find a proper estimator of . Consider the following simplifying notation. For a fixed , let denote the truncated version of the density defined as
[TABLE]
By replacing and by their estimators in the conditional likelihood , we can show that we obtain (up to factors that do not depend on ) the following truncated likelihood:
[TABLE]
The estimator is then simply a maximizer of Equation (13). We can thus see that is an MLE of the truncated density , based on the first abundance counts. This completes our estimator construction. Indeed, to estimate , we first compute directly from the samples by maximizing Equation (13), we then calculate using Equation (11), and lastly we substitute both to obtain using Equation (9).
We conclude by noting that all the derivations we performed were based on the premise that a value of was given. As and depend on , in what follows either we make this explicit by writing and respectively or keep it implicit when the notation gets encumbered. Similarly we write . We also sometimes use the notation instead of to make it explicit that the estimator of depends on .
2.4 Relationship to Other Estimators
Despite the fact that is estimated by truncating the model to the abundance values between and , our estimator differs from the traditional truncation with conditional MLE-type estimators often described in the literature, as overviewed in the introduction. To be precise, assume the same parametric rare-species model is used for , and recall that in these classical estimators the data is truncated and the conditional MLE is solved using the zero-truncated version of to obtain , and then the rare-species count is estimated by:
[TABLE]
The abundant species are then assumed to be represented exactly by what is seen:
[TABLE]
The combined estimator is therefore:
[TABLE]
The following proposition identifies the condition under which our estimator is equivalent to this classical estimator.
Proposition 1
If is supported on then the two estimators and are equivalent.
Proposition 1 means that if the parametric part and the nuisance parameter in the model are supported on disjoint sets, then one can split the data set into rare-species data and abundant-species data . In this context, inference on rare species is not affected by the estimation of the nuisance parameter and thus throwing away high-abundance data is justified. On the other hand when does extend over all integers, then one should not ignore any part of the data, and instead one should perform a hybrid truncation, as suggested by in order to obtain efficient estimators.
Thus far we have considered the general context for any eligible distribution some particular cases enable us to make simple and concrete connections between and other popular estimators that come close to falling within our framework. In particular, Chao, in Chao, (1984), suggests the following popular estimator
[TABLE]
The following proposition shows that our estimator , for and corresponding to a pure Poisson distribution, is equivalent to Chao’s . As such, we can interpret our estimator as a generalization of Chao’s, where is no longer restricted to and where may be more general than a pure Poisson distribution.
Proposition 2
*Assume that and for all Then *
It is worth noting that Zelterman in Zelterman, (1988) explicitly considers the pure Poisson model with access only to the first two counts and , and suggested as an estimator for , showing certain robustness properties under heterogeneity in the true model (see Zelterman, (1988) for more details). It is indeed straightforward to verify that for and a pure Poisson model for , we have that our estimator maximizing the truncated likelihood of Equation (13) corresponds to that of Zelterman: .
An effective proof of Proposition 2 is given in the work of Böhning et al in Böhning *et al. *, (2013) within an alternative framework: using the conditional expectation of . In the Appendix, we propose a simpler proof that is more in line with our framework, by plugging-in , which as noted is the correct conditional ML estimate, into our expression for
We end by noting that if the abundant species are not taken into account, we would have the estimator of given by . This estimator is in the spirit of pure-truncation, and would clearly deviate from (the denominator has no factor). Since we establish the latter to be consistent within our model, then it follows that the former is not (see also Böhning & van der Heijden, (2009) for a more quantitative comparison between Chao’s and Zelterman’s estimators of .)
2.5 Choice of via Model Selection
To end the discussion of our estimator of , we stress once again that depends on the integer truncation parameter , which delimits the zone of influence of the abundant species through the support of the nuisance parameter . When is not known, we need a procedure to estimate this parameter. This is effectively a model selection problem, which we now address using the Goldenshluger-Lepski (G-L) method as inspiration. The G-L method was introduced in Goldenshluger & Lepski, (2011) in the context of bandwidth selection for kernel density estimation. In the current paper we use it heuristically, without formal proofs. Experimental evidence, however, suggests that the method is very effective.
The principle of the method is as follows. As our estimator is of the form , we focus on the problem of estimating . Let us drop the argument from the notation, to make the exposition clearer. Assume that we have a known upper bound on the largest value could take, and let be the least that enables the necessary identifiability assumptions. If we relax the requirement that is positive on its support, we have successively smaller nested models as varies from to . Each of these models has a corresponding version of our estimator, that we denote by . The (squared) bias of each model is . The variance of each model is . The mean squared error risk decomposes as usual into the sum of bias and variance, . Now observe the following:
- •
For , the consistency result of Theorem 1 tells us we are asymptotically unbiased.
- •
For Theorem 2 shows that we are efficient and therefore asymptotically we have the least variance.
- •
For , the estimator becomes inefficient and the variance may be higher. Intuitively, this is because less of the data is used to estimate when the truncation is stricter.
- •
For , Theorem 1 tells us that we may have a non-vanishing bias. However, the variance itself may be lower simply because more -corrupted data is used to converge to an incorrect value of .
The inevitable bias-variance tradeoff thus manifests itself in this framework, and the best compromise in terms of risk will be achieved at the correct model class . If accurate proxies and are available, then we may empirically select a model near , by minimizing
[TABLE]
The bootstrap method is one effective way for estimating . In its simplest version, bootstrap consists of resampling points from the data and computing an estimator from the resampled data. Then this is repeated a number of times, say , and the variance is estimated as:
[TABLE]
While the resampling process of the bootstrap is good at quantifying the relative (to ) variability of the resampled estimators, it offers no absolute reference point, crucial for estimating the bias. Luckily, as we have argued, the larger model classes have small bias and can themselves be used as a reference point. The Goldenshluger-Lepski method suggests the following method to obtain a bias proxy:
[TABLE]
where stand for the non-negative part. The justification and behavior for this bias proxy needs to be rigorously established, as is done for kernel width selection in Goldenshluger & Lepski, (2011). For our heuristic use, we provide simply the intuition behind it. This formula can be interpreted by noticing that the maximum of over is indeed approximately the bias since, as we described, the smaller models are (asymptotically) unbiased. But because we only have access to instead of , and since the smaller models have higher variance, we place a conservative confidence bound on the end using in order not to overestimate the bias.
Equations (14) (the selection of ), (15) (the bootstrap variance proxy), and (16) (the bias proxy) completely specify a heuristic model selection procedure for estimating the integer truncation parameter .
3 Analysis of the Estimator
3.1 The semiparametric framework
We now analyze the convergence and optimality of our estimator in the context of efficient estimation, when the model contains nuisance parameters. We do so particularly in order to handle the nonparametric component within our semiparametric model. In the absence of such nuisance parameters, efficiency may be defined in terms of attaining the Cramér-Rao bound. In regular parametric models, the Cramer-Rao bound is the variance of the score function, itself (often) defined as the derivative of the log-likelihood, and efficient estimators are at first order empirical means of the score function. The nuisance parameters, however, can lead to unavoidable loss in the accuracy of any estimator. The notion of efficiency can then be extended by assessing new lower bounds to the variance of the parameters of interest. We provide the details in Section A.4 below, and describe here what is useful to state our results.
One can define a set of score functions relatively to the nonparametric part of the model, built using one dimensional submodels (see Section A.4 for details). Then, if is the usual score function (given by the partial derivative and gradient with respect to and respectively of the log-likelihood in the full model), the efficient score function related to is then defined component-wise as where is the orthogonal projection onto the closure of the linear space spanned by The efficient score functions play the same role for efficient estimators (if they exist) as the ordinary score functions for the maximum likelihood estimators in a parametric model with no nuisance parameter. Namely, they lead to the best asymptotic variance for any estimator. The corresponding efficient Fisher information is a matrix whose components are the variances and covariances of the various components of the vector of efficient score functions.
As such, this leads to what we shall give as formal definition of the properties of consistency and efficiency:
Definition 1
As , an estimator sequence is:
- •
Consistent*, if in probability.*
- •
Efficient* (asymptotically), if*
[TABLE]
Note that the typical asymptotics for estimator sequences rely on increasing sample size. The sample size in our problem is , as the samples consist of the positive (observed) abundances . Thus, the sample size is a random quantity. Despite this, it is clear that as , we also have that in probability, and we therefore think of the two asymptotic notions interchangeably.
One of the challenges is that in many models the efficient score is not amenable to be used in the same way as the ordinary score because the orthogonal projection might not be available in closed form. In Proposition 3 (stated and proved in Section A.4), we show that such a closed form can be obtained in our model, and give the expressions that ensue for the efficient score functions for estimating the parameters and in the model
3.2 Consistency and Efficiency
In what follows, when the true model lies within the hypothesized class , we refer to the true parameters by , , and , and to the true truncation by . We first list some regularity assumptions that we have recourse to throughout.
Assumptions
[Compactness] is a compact subset of . 2. 2.
[Identifiability] The parameter is identifiable from the truncated density , as defined by Equation (12). 3. 3.
[Continuity] For all in is a continuous function of , and for all in and .
Let us now move to the main results of this section, the consistency and efficiency of and whenever and some further properties that give more insight into these estimators. We begin with the consistency result stated below as Theorem 1.
Theorem 1
Under Assumptions 1-3, as tends to infinity, the following results hold:
If then and converge in probability to and respectively;
If then converges in probability to the set of maximizers of .
The results in Theorem 1 are remarkable since they ensure the consistency of and for a fixed , as long as it is smaller than or equal to its true value and identifiability holds. If, however, one chooses greater than then the proposed estimators may not be consistent. (This leads to the challenge of choosing via model selection when is unknown, as described in Section 2.5). We now complement this consistency result with efficiency properties.
Theorem 2
Consider Assumptions 1-3, and assume further that is , that , and that the efficient Fisher information is non-singular at . Then, is asymptotically efficient at
Remark 1
The estimators and have the following properties:
* depends on the observations no greater than and on the cardinality of those that are greater than *
* depends only on the observations no greater than *
These follow either from direct inspection or from the proof of Lemma 2 in the Appendix. In particular, solves the efficient score equation (41) which depends only on abundances no greater than Now, from equation (40), for a given estimator of the estimator depends on the greater than only through their cardinal and the property follows.
Theorem 2, asserts the efficiency of and and through them of the corresponding estimator of the total number of species . Remark 1 also sheds light on the fact that the latter depends only on: (1) the threshold (2) the number of observed species and (3) on the abundances of rare species (those that are not greater than ). In other words, as in the case of pure truncation, the abundant species contribute only through their cardinality. That said, distinguishes itself by using this cardinality to estimate how to weigh appropriately the respective contributions of both the rare and abundant species, using the parameter .
4 Simulations and Experiments
To illustrate the impact of truncation on our ability to estimate the number of species, we give some numerical simulations and experiments. To make our theoretical work concrete and results easily reproducible, we consider simple parametric families. In particular we look at a single Poisson distribution and a Gamma-Poisson mixture, which gives rise to the negative binomial distribution. In Section 4.1, we perform synthetic experiments for both, and use this to illustrate the heuristic method of selecting the best truncation. In Section 4.2, we consider real data in the form of literary texts, and confine ourselves to the negative binomial model. In order to be able to compare to a known ground truth, we adapt our number of species framework to the very related observational richness problem, and show that the choice of truncation has a significant impact on estimation accuracy.
4.1 Number of Species Simulations
4.1.1 Algorithms to compute
As we take to be a parametric family, many of the EM-style MLE algorithms for parameter estimation in such frameworks can be adapted to zero- to -truncated versions of the distributions. This is all that’s needed since, for a fixed value of computing is an optimization problem that amounts to solving the efficient score equations in (38), and by Theorem 1 this is equivalent to solving equation (41). For example, when is a Poisson distribution, it is not difficult to check that the truncated MLE (41) becomes exactly
[TABLE]
leading to the fixed point equation in which stands for the cumulative distribution function of the Poisson model with parameter This is equivalent to moment-matching and the solution could be found numerically by performing a bisection search, for example. Similar parameter searches can be performed for the truncated negative binomial distribution that we consider in this section. In a more complex model where is a finite mixture of Poisson distributions, that is when for all we can derive an EM algorithm for the truncated MLE similarly to the classical Poisson mixture. We do not elaborate this further, except to mention that each EM iteration entails the solution of fixed point equations, as in (17), for each Poisson component.
Design
To investigate the performance of the new estimator and compare it to other existing estimators, we conducted a set of experiments with synthetic data.
In the first set of these experiments, we take the abundances of rare species to be distributed according to a single Poisson distribution with parameter and the nuisance distribution (of abundant species) is the uniform distribution on The resulting distribution has density with and the aforementioned uniform distribution. Now, for any fixed we generate a sample of size from the Bernoulli model with parameter then generate the corresponding counts observations according to the Poisson or uniform model. The parameters and are fixed equal and respectively whereas ranges over The observed zero-truncated counts are used to compute our new estimator and some other existing estimators with which will be compared.
To show that the results extend to other parametric families, we perform a limited set second experiments, where we take the abundance of rare species to be distributed according to a Gamma-Poisson mixture, which leads naturally to the negative binomial distribution. In particular, in this case is two-dimensional, consisting of real parameters and , and in Equation (1) we have . This results in being the negative binomial distribution with parameters and . We fix and take to vary over the range . Larger values of are needed to learn this model even in the absence of nonparametric noise. We consider the range of and we generate a sample of size from the Bernoulli model with parameter then generate the corresponding counts observations according to the negative binomial or uniform model. That is, the observational model is as before, .
Risk approximation using G-L method
We use the simulations as an opportunity to illustrate the G-L method and show the quality of the risk estimation by the proposed proxy in the selection rule. As displayed in Figure 1, the proxy provides a good approximation of the true risk when is estimated by a bootstrap procedure as in Equation (15). Note that in this numerical example we calculate the risk, bias proxy, and variance proxy for instead of . The approximation is remarkably accurate especially in the region where the estimator is asymptotically unbiased that is for Its remains satisfactory, but not overly so, for some greater than . This indicates that the bootstrap procedure is a good choice to estimate Note that Figure 1 corresponds to the results of simulations of the single Poisson model with parameters and We obtain similar results for all other parameter choices.
Performances of
We focus on the performance of by calculating its Monte-Carlo mean and the renormalized standard error based on samples. We also investigate the bootstrap-based confidence interval for by providing the estimated non-coverage probabilities
[TABLE]
and
[TABLE]
where is the bootstrap-based confidence interval using the estimated model from the Monte-Carlo sample. For the single Poisson model, the results are summarized in Table 1. It is clear that the renormalized decreases when grows and increases as becomes larger. As the small values of characterize small abundances and that a high value of means that there is a large number of rare species in community according to the simulated model the observed variation of suggests that a high number of rare species will be estimated with larger variance. We can also notice that the decreases with in all simulated configurations showing the accuracy of the method when becomes larger. As the large values of describe the asymptotic regime of the estimators and we believe that the observed accuracy is related to the asymptotic efficiency of those estimators which improves the variance and then the mean square error MSE of as will be seen later. Table 2 summarizes the results for the Gamma-Poisson mixture model, with very comparable observations. Note that both sets of experiments show that we cannot rely on bootstrap confidence intervals as true intervals for the estimator. While the bootstrap is adequate in estimating the variability of the estimator, it does not accurately convey its location. It exhibits a clear skew to smaller values, which could be explained by the fact that resampling from the base distribution reduces the number of distinct observations. Therefore more principled methods are needed to go beyond point estimates in species richness estimation. One such avenue is through the use of concentration inequalities, Ben Hamou *et al. *, (2017).
Comparison with other estimators
We end the simulations by comparing the proposed estimator of the number of species to other existing one in literature. We focus entirely on the single Poisson model, which represents the ground truth assumption of many of theses estimators. We consider the Chao’s estimator defined as lower bound for and proposed in Chao, (1984), the coverage based estimator proposed in Chao & Lee, (1992) by Chao and Lee, the estimator of using the expected proportion of duplicate species in the sample (by Chao and Bunge in Chao & Bunge, (2002)), the nonparametric MLE of using a penalized likelihood (by Wang and Lindsay in Wang & Lindsay, (2005)) and an extension of Chao’s estimator proposed by Lanutheang and Böhning in Lanumteang & Böhning, (2011). The criteria used for this comparison (Mean, rMAE: relative Mean Absolute Error and rMSE: relative Mean Square Error) are computed and presented in Table 3. The six estimators display a good performance in all simulated configurations and seems to better estimate than all other methods. This is quantified in Table 3, by the remarkably small value of as compared to the others. This shows that, despite our results being about the asymptotic efficiency of and , we can expect finite-sample improvements for the estimator , when is moderately large. Also note that all six estimators become less reliable for very small value of or large value of explaining thus the common difficulty for these approaches to better approximate in the case of a large number of rares species, which touches upon the inherent problems of unidentifiability Mao & Lindsay, (2007).
4.2 Observational richness in text data
Rather than estimating the absolute number of species, an important extension of the species richness problem is concerned with estimating the number of distinct species to be observed in a sample larger than the current sample of individuals. Indeed, the abundance data is ostensibly obtained by performing a sampling of individuals. If the said sample is enlarged, then how do the new abundances relate to the original ones? In the words of Fisher Fisher *et al. *, (1943), in a pure Poisson abundance model: “Obviously, [the parameter ] will be proportional to the size of the sample taken […]”. This is most easily seen in the individual sampling model of Equation (3): when the binomial size parameter is changed from to , the parameters of the corresponding Poisson mixture are changed from to .
Generally in a Poisson mixture model, therefore, a factor increase in the sample size is equivalent to a dilation of the mixture distribution. Let denote the expected number of distinct symbols in the enlarged sample, and thus . The observational richness estimation problem can thus be concretely stated as the problem of estimating , based on .
One application of the observational richness problem is to forecast the vocabulary of an author, from a portion of their text. This was popularized in the work of Efron & Thisted, (1976), who applied this methodology to the complete works of William Shakespeare. The problem goes back to the work of Good & Toulmin, (1956), who approached it from an empirical Bayesian perspective, without any specific parametrization. The earlier work of Fisher *et al. *, (1943) also implicitly addressed the same problem.
Here, we restrict ourselves to the context of a parametric Poisson mixture abundance model for , that is as in Equation (1), with appropriately parametrized by . We require the family of such densities to be closed under dilation, as in for all and , there exists , such that for all measurable subsets , . Furthermore, we assume that for fixed , the transformation is continuous in the sense that if a sequence then the sequence . Note that for discrete mixtures, the scaling simply shifts the supports by , and for continuous mixtures it expands and scales the density by , and the requirement in either case is for the resulting density to remain an element of the parametric family.
As we focus primarily on text data, the Gamma-Poisson mixture family is very well-suited. Recall that in this case is two-dimensional, consisting of real parameters and , and , and the corresponding negative binomial distribution with parameters and . To dilate the Gamma distribution, it is easy to see that one simply scales . This corresponds to a transformation of the negative binomial parameter .
This paper’s framework applies to this problem as follows. If the rare abundances are well modeled by a Gamma-Poisson mixture while the abundant ones are not, then our framework allows us to efficiently learn the parameters and . By continuity, for fixed we also have an efficient estimator of . Since is assumed to stay constant, we then have
[TABLE]
We could therefore use our estimates and to evaluate and thus to estimate as follows:
[TABLE]
The data we look at is French playwrite Molière’s Tartuffe play, which we gradually observe a portion of and try to estimate the number of distinct vocabulary words. Thus, the scale is the ratio of the total text size to the size of the observed text, varying from [math] to . For this problem, we illustrate for various choices of and also the Goldenshluger-Lepski selected in Figure 2. Note how quickly the result becomes an accurate estimate of the vocabulary. But most importantly, note how sub-optimal choices of the truncation can adversely affect the performance of the estimator.
5 Conclusion
In this paper, we revisited the species richness estimation problem and studied a commonly followed practice of truncating the data into rare and abundant species. We proposed a semiparametric framework to model such a truncation as a parametric component well-suited to model rare species and a nonparametric nuisance component to cover the abundant species in an agnostic manner. We showed that asymptotic efficiency in this framework requires handling the truncation more delicately. This is in particular true if the rare species model has a significant overlap with the abundant species. Finally, we proposed a heuristic method to learn a good truncation threshold from data.
Several possible avenues of investigation may be proposed. We already mentioned the importance of going beyond point estimates. One would also like to relax the assumption that the abundant species are truly located entirely away from zero. In particular, it is important to handle the situation when such a dichotomy arises from an underlying binomial mixture model. Some recent approaches to species richness have successfully used Chebyshev polynomials as a fitting model, see for example Orlitsky *et al. *, (2016), and one would like to understand the relationship between such fits and mixtures of Poissons. Finally, one would hope that a truncation threshold that automatically conforms to the underlying model could make the most of the available data and thus give a fundamental theoretical edge, perhaps in the form of adaptive rates.
Appendix A Proofs
A.1 Proof of Proposition 1
For any fixed the classical conditional MLE satisfies
[TABLE]
Let us consider now the new conditional MLE proposed in this work.
[TABLE]
But using equation (11), we have
[TABLE]
from which we get
[TABLE]
Now, if is supported on then equals in the last expression which finally gives .
A.2 Proof of Proposition 2
For equals and being the Poisson distribution with parameter it is not difficult to see that Then,
[TABLE]
The last equality holds by replacing and by and respectively.
A.3 Proof of Theorem 1
To prove we use the fact that is the maximum likelihood estimator in the model with density . Note that maximizing the likelihood of equation (13) amounts to maximizing in the criterion which as tends to infinity converges almost surely to when . Moreover, we have
[TABLE]
On the right hand side of this inequality, converges almost surely, thus in probability, to zero. Also, is bounded since , for all and all We conclude that
[TABLE]
It is easy to see that
[TABLE]
attains uniquely its maximum equals zero at since the true model is identifiable, as assumed. We then obtain
[TABLE]
where is the euclidean distance between and . As maximizes we have o This, together with the condition in equation (18) and the above convergence in probability, entails that converges in probability to as tends to infinity. This result holds from Theorem in van der Vaart, (1998).
To end part of the theorem, recall Equation (11):
[TABLE]
We then observe from the law of large numbers that as tends to infinity, converges almost surely to when Recall that we assume to be continuous in for each . Thus using the continuous map theorem and the convergence in probability of to , we find that and converge in probability to and respectively when . We finally obtain the convergence in probability of to
[TABLE]
using once again the continuous map theorem. This ends the proof of part of Theorem 1.
Similar arguments to what we have given here can be used to prove part of the Theorem, namely that if , then converges in probability to the set of maximizers of , in the sense that the probability of falling in an -dilation of this set tends to as .
A.4 Efficient score functions and efficient Fisher information
We now build up some notation.Let denote the set of measurable functions defined on the support of by
[TABLE]
For a given in a real number and a vector of dimension let us define and This parametrization of and defines a path (a one-dimensional sub-model) in the model To simplify the notation, we let stand for and for . Recall the definition of score functions.
Definition 2
A differentiable path is a map from a neighborhood of [math] to with such that, for some measurable real valued function one has
[TABLE]
The one-dimensional sub-model is then said to be differentiable in quadratic mean at with score function .
A more useful way to determine the score function of a model such as is to take the derivative with respect to of the log-likelihood at that is
[TABLE]
We will use a dot-notation to indicate differentiation with respect to a parameter. Recall first the parametric score function and the parametric vector score function which are the partial derivative and gradient with respect to and respectively of the log-likelihood in the full model. We have respectively
[TABLE]
with the gradient function of the density
In the model defined in Equation (5), a straightforward calculation shows that the score function of the one-dimensional sub-model is such that
[TABLE]
where and are the scaling scalar and -dimensional vector of the parametrizations and respectively, and where denotes the usual inner product.
Now, we recall briefly the notions of tangent set and efficient score function for the model considered here. The maximal tangent set to the model at is the set of all score functions of a one-dimensional sub-model. We denote it , and in our case it is given by
[TABLE]
Consider again the path related to the model , but now with the parameters and fixed, then the tangent set at for the nonparametric part of the model in (5) is denoted and given by
[TABLE]
The efficient score function related to a given component of the parameter vector is then defined component-wise as where is the orthogonal projection onto the closure of the linear space spanned by
The expressions of the efficient score functions are given in the following proposition, the coefficients of the efficient Fisher information matrix are displayed in the proof of this proposition.
Proposition 3
The efficient score functions for estimating the parameters and are given for by
[TABLE]
and
[TABLE]
respectively. The efficient Fisher information is a matrix of order with coefficients given by equations (31)-(33).
Proof
Recall from (24) the definition of the tangent set of the nonparametric part of the one-dimensional sub-model:
[TABLE]
and let denote the closure of the linear space spanned by in With this notation, recall that the efficient score function related to a given component of the parameter vector is defined component-wise as where is the orthogonal projection onto in .
To reduce clutter, let refer to a particular component . We first give a closed form expression of the orthogonal projection. In particular, we have that:
[TABLE]
where is a constant depending on as follow
[TABLE]
To see this, first observe that for every score function in the model the projection is an element of the subspace so that it must be a linear combination (or a limit thereof) of elements of the form for some . Since the latter all vanish on the set , so does .
Next, let be any -integrable function that is orthogonal to the space , that is:
[TABLE]
In particular, note that such an is orthogonal to elements of itself. These, once again, have the form for some . By design, let us choose such that and for in the support of and elsewhere. It is easy to verify that such a choice does indeed lie within . On the other hand, the orthogonality of and in implies that:
[TABLE]
or equivalently
[TABLE]
As is strictly positive over its support, this implies that Thus all such must be constant on the support of .
Now let us specialize to the components of the efficient score function, by writing them as . Since we have thus determined that vanishes on and is constant over , we have therefore established the expression of the projection as in Equation (27) as claimed. To obtain the expression of the constant in Equation (28), we can once again use the fact that is a linear combination of for , in addition to the fact that for all such , to write:
[TABLE]
Now, we can easily compute the efficient score functions using:
[TABLE]
using the expressions of and in Equation (22), we explicitly get and . We start with . We have:
[TABLE]
We then determine from Equation (28),
[TABLE]
and we finally obtain as
[TABLE]
Moving on to , from Equation (22) and using the fact that , we have:
[TABLE]
Then
[TABLE]
and
[TABLE]
The efficient Fisher information matrix has coefficients defined as
[TABLE]
for all Recall that when we write , we are referring to a vector of score functions, whereas stands for the coordinate of . The computation of these coefficients leads to
[TABLE]
[TABLE]
[TABLE]
with the partial derivative of with respect to the coordinate of
A.5 Proof of Theorem 2
We first state and prove two lemmas that will be used for the proof of Theorem 2.
As usual, let be a component of the parameters vector , denote by the true value of (if it exists), and let be a closed neighborhood of . We denote by the subset of defined by
[TABLE]
Lemma 1
Let be a consistent estimator of If is twice continuously differentiable for every and Assumptions - hold, then is a Donsker class with square integrable envelope that contains with probability that tends to one.
Proof
We adapt the method used in Example from van der Vaart, (1998). Recall that a -bracket is a subset of such that The bracketing number is the minimum number of -brackets needed to cover and the bracketing entropy is the logarithm of this quantity. To show that is Donsker, we establish the sufficient condition that the square root entropy integral
[TABLE]
is finite. (See Theorem in van der Vaart, (1998).)
We begin by establishing continuity properties of the parametric efficient score functions. From the differentiability of in and the expressions given in Proposition 3, it is evident that is always a differentiable function of . Let us denote these derivatives by . For and we can respectively compute these as
[TABLE]
and
[TABLE]
[TABLE]
By inspection, we find that is always continuous itself, and that is also continuous provided that is in and for all , as assumed. These conditions also imply that have a finite -norm and that these functions are Lipschitz-continuous on . We thus have a non-negative bounded such that
[TABLE]
Now, from this Lipschitz condition, it follows that if then This means that we need as many -balls (a ball with radius ) to cover as we need -brackets to cover Since the number of -balls needed to cover is such that
[TABLE]
with a constant depending only on , it follows that the bracketing number is
[TABLE]
Thus the bracketing entropy is of order smaller than , whose square root is integrable near [math]. This establishes the sufficient condition of Equation (35), and thus is indeed Donsker.
To complete the other claims of the proof, note that for all in has a finite -norm and that for some , for all . The boundedness of is obtained from the expression of and The constant function is a square integrable envelope for We use the continuity of the map and consistency of to show that for all . This proves that contains with probability that tends to one and the lemma holds.
The result in Lemma 1 holds for and for all and thus also for their union We conclude that is a Donsker class with square integrable envelope that contains with probability that tends to one.
Lemma 2
* and solve the efficient score equations:*
[TABLE]
Proof
Note that the efficient score equation leads to equality
[TABLE]
whose solution is given by
[TABLE]
Likewise, if one sets to zero all the partial derivatives of the logarithm of the likelihood in (13), one has
[TABLE]
This equality is equivalent to with replaced by in The zero notation here refers to the null vector of .
We now prove the asymptotic efficiency of the estimators. Note that all results in this proof are stated under the restriction when necessary.
By Lemma 2, and are such that
[TABLE]
As and are free of and that this is also true for the plug-in estimators and , it is not difficult to verify that
[TABLE]
The asymptotic efficiency of follows from Theorem in van der Vaart, (1998). As assumptions, this theorem needs the assertions of Theorem 1 (consistency) and Lemma 1 (Donsker property), in addition to the following two convergence properties pertaining to the “plug-in” score functions. In particular, we need to show that our estimators satisfy:
[TABLE]
and
[TABLE]
where stands for , stands for the parametric plug-in , and and are the stacked vectors of components, and respectively.
To establish Equations (44) and (45), we can use for each parameter the continuity properties of per component, as in the proof of Lemma 1. In particular, note first that for all , we have that is constant. Therefore, for each parameter we need only to account for the convergence of for , all of which happen (in probability), by continuity.
It follows that for each , converges to [math] in probability, and since we have only finitely many distinct values, the convergence is uniform for all . Equation (44) is thus immediate.
On the other hand, in probability for each . By finiteness, it follows that , and consequently . By using once again the fact that is constant beyond , the convergence reduces again to finitely many convergences, and thus . We can therefore write:
[TABLE]
which completes the proof of Equation (45) and the theorem.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Barger & Bunge, (2008) Barger, Kathryn, & Bunge, John. 2008. Bayesian Estimation of the Number of Species using Noninformative Priors. Biometrical Journal , 50 (6), 1064–1076.
- 2Ben Hamou et al. , (2017) Ben Hamou, Anna, Boucheron, Stephane, & Ohannessian, Mesrob I. 2017. Concentration Inequalities in the Infinite Urn Scheme for Occupancy Counts and the Missing Mass, with Applications. Bernoulli .
- 3Böhning & Schün, (2005) Böhning, Dankmar, & Schün, Deter. 2005. Nonparametric maximum likelihood estimation of population size based on the couting distribution. Journal of Royal Statistical Society , 54 (4), 721–737.
- 4Böhning & van der Heijden, (2009) Böhning, Dankmar, & van der Heijden, Peter G. M. 2009. A covariate adjustment for zero-truncated approaches to estimating the size of hidden and elusive populations. The Annals of Applied Statistics , 3 (2), 595–610.
- 5Böhning et al. , (2013) Böhning, Dankmar, Vidal-Diez, Alberto, Lerdsuwansri, Rattana, Viwatwongkasem, Chukiat, & Arnol, Mark. 2013. A generalization of Chao’s estimator for covariate information. Biometrics , 69 (4), 1033–1042.
- 6Bunge & Barger, (2008) Bunge, John, & Barger, Kathryn. 2008. Parametric models for estimating the number of classes. Biometrical Journal , 50 (6), 971–982.
- 7Chao & Bunge, (2002) Chao, A., & Bunge, J. 2002. Estimating the number of species in a stochastis abundance model. Biometrics , 58 (September), 531–539.
- 8Chao, (1984) Chao, Anne. 1984. Nonparametric estimation of the number of classes in a population. Scand J Statist , 11 , 265–270.
