A truncation model for estimating Species Richness

Fran\c{c}ois Koladjo; Mesrob I. Ohannessian; \'Elisabeth Gassiat

arXiv:1705.07509·stat.ME·May 23, 2017

A truncation model for estimating Species Richness

Fran\c{c}ois Koladjo, Mesrob I. Ohannessian, \'Elisabeth Gassiat

PDF

Open Access

TL;DR

This paper introduces a semiparametric truncation model for estimating species richness, incorporating an unknown threshold to distinguish rare from abundant counts, and demonstrates its efficiency and relation to existing estimators.

Contribution

It proposes a novel semiparametric truncation model with an unknown threshold for species richness estimation, including new estimators with proven asymptotic efficiency.

Findings

01

The proposed estimators are asymptotically efficient.

02

The model recovers Chao's lower bound estimator as a special case.

03

Simulation results show competitive performance compared to existing methods.

Abstract

We propose a truncation model for abundance distribution in the species richness estimation. This model is inherently semiparametric and incorporates an unknown truncation threshold between rare and abundant counts observations. Using the conditional likelihood, we derive a class of estimators for the parameters in the model by a stepwise maximisation. The species richness estimator is given by the integer maximising the binomial likelihood when all other parameters in the model are know. Under regularity conditions, we show that the estimators of the model parameters are asymptotically efficient. We recover the Chao $^{^{'}}$ s lower bound estimator of species richeness when the model is a unicomponent Poisson $^{^{'}}$ s model. So, it is an element of our class of estimators. In a simulation study, we show the performances of the proposed method and compare it to some others.

Tables3

Table 1. Table 1: Performance of N ^ τ ^ subscript ^ 𝑁 ^ 𝜏 \widehat{N}_{\widehat{\tau}} for single Poisson distributions. Inf and Sup are given in percentage ( % percent \% ).

$q$	$N$	Mean	$\frac{S e}{N}$	Inf ( $%$ )	Sup ( $%$ )	Mean	$\frac{S e}{N}$	Inf	Sup	Mean	$\frac{S e}{N}$	Inf	Sup
		$θ = 0.6$				$θ = 1$				$θ = 1.5$
$0.4$	$200$	$192$	$0.116$	$1.5$	$26.3$	$200$	$0.058$	$2.5$	$7.2$	$199$	$0.036$	$2.2$	$11.7$
	$1000$	$1005$	$0.043$	$2.9$	$3.5$	$1001$	$0.024$	$3.6$	$4.6$	$1000$	$0.014$	$3.1$	$4.2$
	$5000$	$5003$	$0.018$	$3.0$	$3.4$	$4999$	$0.011$	$3.0$	$6.6$	$5001$	$0.006$	$3.3$	$3.7$
	$10000$	$10002$	$0.013$	$3.5$	$4.4$	$10002$	$0.007$	$3.3$	$4.3$	$10002$	$0.005$	$3.4$	$4.6$
$0.6$	$200$	$199$	$0.133$	$1.8$	$11.7$	$199$	$0.073$	$3.1$	$9.1$	$198$	$0.042$	$2.0$	$12.7$
	$1000$	$1003$	$0.055$	$3.3$	$5.0$	$1001$	$0.030$	$3.9$	$4.1$	$1000$	$0.017$	$2.9$	$2.7$
	$5000$	$5003$	$0.023$	$4.1$	$3.5$	$5001$	$0.013$	$3.5$	$2.8$	$5000$	$0.008$	$2.7$	$4.3$
	$10000$	$10009$	$0.017$	$4.3$	$3.7$	$10005$	$0.009$	$4.0$	$3.9$	$9999$	$0.006$	$3.3$	$4.0$
$0.8$	$200$	$192$	$0.160$	$2.5$	$15.5$	$195$	$0.079$	$1.5$	$13.3$	$196$	$0.048$	$1.1$	$17.0$
	$1000$	$1005$	$0.063$	$4.2$	$5.0$	$1002$	$0.034$	$3.7$	$4.7$	$999$	$0.021$	$3.5$	$6.5$
	$5000$	$5017$	$0.027$	$5.2$	$3.6$	$5000$	$0.015$	$3.9$	$4.0$	$4997$	$0.009$	$3.1$	$4.1$
	$10000$	$10001$	$0.019$	$2.9$	$4.6$	$9999$	$0.011$	$3.3$	$4.6$	$9998$	$0.006$	$3.2$	$4.4$

Table 2. Table 2: Performance of N ^ τ ^ subscript ^ 𝑁 ^ 𝜏 \widehat{N}_{\widehat{\tau}} for Gamma-Poisson mixtures ( p = 0.8 𝑝 0.8 p=0.8 ). Inf and Sup are given in percentage ( % percent \% ).

$q$	$N$	Mean	$\frac{S e}{N}$	Inf	Sup	Mean	$\frac{S e}{N}$	Inf	Sup	Mean	$\frac{S e}{N}$	Inf	Sup
		$r = 0.5$				$r = 1$				$r = 2$
$0.4$	$10, 000$	$9, 854$	$0.038$	$1.3$	$10.4$	$9, 955$	$0.012$	$2.2$	$14.6$	$9, 998$	$0.003$	$2.9$	$9.0$
	$20, 000$	$19, 413$	$0.020$	$0.0$	$24.0$	$19, 867$	$0.008$	$1.1$	$25.0$	$19, 981$	$0.002$	$1.3$	$15.5$
	$50, 000$	$48, 359$	$0.013$	$0.0$	$64.0$	$49, 561$	$0.005$	$0.0$	$50.0$	$49, 933$	$0.001$	$0.0$	$39.0$
$0.6$	$10, 000$	$9, 618$	$0.042$	$0.4$	$25.0$	$9, 883$	$0.015$	$0.8$	$16.3$	$9, 986$	$0.003$	$1.3$	$13.3$
	$20, 000$	$19, 222$	$0.035$	$0.4$	$35.4$	$19, 823$	$0.011$	$1.1$	$27.0$	$19, 964$	$0.002$	$1.3$	$27.9$
	$50, 000$	$47, 792$	$0.018$	$0.0$	$71.0$	$49, 319$	$0.005$	$0.0$	$72.0$	$49, 885$	$0.002$	$0.4$	$53.2$
$0.8$	$10, 000$	$9, 561$	$0.053$	$0.7$	$23.1$	$9, 843$	$0.016$	$0.3$	$27.0$	$9, 973$	$0.004$	$0.7$	$23.3$
	$20, 000$	$18, 770$	$0.031$	$0.0$	$49.0$	$19, 623$	$0.011$	$0.1$	$50.7$	$19, 968$	$0.003$	$3.2$	$21.4$
	$50, 000$	$46, 812$	$0.019$	$0.0$	$86.0$	$49, 128$	$0.006$	$0.0$	$76.0$	$49, 816$	$0.002$	$0.0$	$80.0$

Table 3. Table 3: Comparison of N ^ τ ^ subscript ^ 𝑁 ^ 𝜏 \widehat{N}_{\widehat{\tau}} with five other estimators of N 𝑁 N using 1000 1000 1000 monte-carlo samples. N ^ C h 0 : : subscript ^ 𝑁 𝐶 subscript ℎ 0 absent \widehat{N}_{Ch_{0}}: Chao’s estimator as lower bound on N 𝑁 N in Chao, ( 1984 ) ; N ^ C L : : subscript ^ 𝑁 𝐶 𝐿 absent \widehat{N}_{CL}: The coverage based estimator of N 𝑁 N by Chao and Lee in Chao & Lee, ( 1992 ) ; N ^ C B : : subscript ^ 𝑁 𝐶 𝐵 absent \widehat{N}_{CB}: Estimator of N 𝑁 N using the expected proportion of duplicate species in the sample (by Chao and Bunge in Chao & Bunge, ( 2002 ) ); N ^ W L 0 subscript ^ 𝑁 𝑊 subscript 𝐿 0 \widehat{N}_{WL_{0}} Nonparametric MLE of N 𝑁 N using a penalized likelihood (by Wang and Lindsay in Wang & Lindsay, ( 2005 ) ) and N ^ L B subscript ^ 𝑁 𝐿 𝐵 \widehat{N}_{LB} is an extension of Chao’s estimator proposed by Lanutheang and Böhning in Lanumteang & Böhning, ( 2011 ) .

$q$	Est	Mean	rMAE	rMSE	Mean	rMAE	rMSE	Mean	rMAE	rMSE
		$θ = 0.6$			$θ = 1$			$θ = 1.5$
$0.4$	${\hat{N}}_{\hat{τ}}$	$1005$	$0.034$	$0.185$	$1001$	$0.019$	$0.058$	$1000$	$0.011$	$0.019$
	${\hat{N}}_{C h_{0}}$	$1010$	$0.045$	$0.341$	$1002$	$0.026$	$0.108$	$1001$	$0.016$	$0.041$
	${\hat{N}}_{C L}$	$1015$	$0.040$	$0.28$ 6	$1007$	$0.023$	$0.084$	$1004$	$0.013$	$0.029$
	${\hat{N}}_{C B}$	$1054$	$0.132$	$23.651$	$1004$	$0.035$	$0.227$	$1002$	$0.018$	$0.051$
	${\hat{N}}_{W L_{0}}$	$1041$	$0.058$	$0.731$	$1024$	$0.035$	$0.292$	$1017$	$0.023$	$0.146$
	${\hat{N}}_{L B}$	$1026$	$0.092$	$1.717$	$1022$	$0.046$	$0.482$	$1014$	$0.028$	$0.162$
$0.6$	${\hat{N}}_{\hat{τ}}$	$1003$	$0.043$	$0.298$	$1001$	$0.024$	$0.088$	$1000$	$0.014$	$0.029$
	${\hat{N}}_{C h_{0}}$	$1007$	$0.056$	$0.522$	$1003$	$0.031$	$0.160$	$1002$	$0.020$	$0.065$
	${\hat{N}}_{C L}$	$1015$	$0.051$	$0.434$	$1008$	$0.027$	$0.125$	$1005$	$0.017$	$0.045$
	${\hat{N}}_{C B}$	$1037$	$0.119$	$3.956$	$1005$	$0.043$	$0.315$	$1002$	$0.022$	$0.080$
	${\hat{N}}_{W L_{0}}$	$1044$	$0.072$	$1.122$	$1034$	$0.047$	$0.510$	$1025$	$0.032$	$0.288$
	${\hat{N}}_{L B}$	$1045$	$0.113$	$2.789$	$1031$	$0.057$	$0.704$	$1018$	$0.034$	$0.250$
$0.8$	${\hat{N}}_{\hat{τ}}$	$1005$	$0.051$	$0.401$	$1002$	$0.027$	$0.118$	$999$	$0.017$	$0.045$
	${\hat{N}}_{C h_{0}}$	$1009$	$0.062$	$0.621$	$1006$	$0.037$	$0.218$	$1003$	$0.023$	$0.088$
	${\hat{N}}_{C L}$	$1020$	$0.058$	$0.553$	$1011$	$0.032$	$0.169$	$1006$	$0.020$	$0.065$
	${\hat{N}}_{C B}$	$1038$	$0.128$	$3.719$	$1007$	$0.051$	$0.433$	$1003$	$0.026$	$0.111$
	${\hat{N}}_{W L_{0}}$	$1062$	$0.088$	$1.550$	$1046$	$0.060$	$0.835$	$1031$	$0.040$	$0.405$
	${\hat{N}}_{L B}$	$1059$	$0.126$	$3.452$	$1041$	$0.069$	$1.054$	$1019$	$0.038$	$0.294$

Equations166

f_{ν} (x) = \int \frac{λ ^{x} e ^{- λ}}{x !} d ν (λ), for x \in N .

f_{ν} (x) = \int \frac{λ ^{x} e ^{- λ}}{x !} d ν (λ), for x \in N .

f_{ν}^{+} (x) = \frac{f _{ν} ( x )}{1 - f _{ν} ( 0 )}, for x \in N_{+} .

f_{ν}^{+} (x) = \frac{f _{ν} ( x )}{1 - f _{ν} ( 0 )}, for x \in N_{+} .

N_{abundant} = i = 1 \sum D 1 {X_{i}^{+} > τ} .

N_{abundant} = i = 1 \sum D 1 {X_{i}^{+} > τ} .

N = N_{rare} + N_{abundant} .

N = N_{rare} + N_{abundant} .

X_{i} = j = 1 \sum m 1 {Y_{j} = i} \sim binomial (m, p_{i}) .

X_{i} = j = 1 \sum m 1 {Y_{j} = i} \sim binomial (m, p_{i}) .

P = {f_{(q, θ, F)} (x) = q R_{θ} (x) + (1 - q) F (x)}

P = {f_{(q, θ, F)} (x) = q R_{θ} (x) + (1 - q) F (x)}

P^{+} = {f_{(q, θ, F)}^{+} (x) = \frac{f _{(q, θ, F)} ( x )}{1 - q R _{θ} ( 0 )}, f_{(q, θ, F)} \in P} .

P^{+} = {f_{(q, θ, F)}^{+} (x) = \frac{f _{(q, θ, F)} ( x )}{1 - q R _{θ} ( 0 )}, f_{(q, θ, F)} \in P} .

L (N, f ∣ (n_{x})_{x \geq 1}) = \frac{N !}{( N - D )! \prod _{x} n _{x} !} f (0)^{N - D} x \geq 1 \prod f (x)^{n_{x}},

L (N, f ∣ (n_{x})_{x \geq 1}) = \frac{N !}{( N - D )! \prod _{x} n _{x} !} f (0)^{N - D} x \geq 1 \prod f (x)^{n_{x}},

L_{b} (N ∣ D, q, θ) = \frac{N !}{D ! ( N - D )!} [q R_{θ} (0)]^{N - D} [1 - q R_{θ} (0)]^{D},

L_{b} (N ∣ D, q, θ) = \frac{N !}{D ! ( N - D )!} [q R_{θ} (0)]^{N - D} [1 - q R_{θ} (0)]^{D},

L^{+} (q, θ, F ∣ (n_{x})_{x \geq 1}) = \frac{D !}{\prod _{x \geq 1} n _{x} !} x \geq 1 \prod f^{+} (x) [\frac{q R _{θ} ( x ) + ( 1 - q ) F ( x )}{1 - q R _{θ} ( 0 )}]^{n_{x}},

L^{+} (q, θ, F ∣ (n_{x})_{x \geq 1}) = \frac{D !}{\prod _{x \geq 1} n _{x} !} x \geq 1 \prod f^{+} (x) [\frac{q R _{θ} ( x ) + ( 1 - q ) F ( x )}{1 - q R _{θ} ( 0 )}]^{n_{x}},

N (q, θ) = \frac{D}{1 - q R _{θ} ( 0 )} .

N (q, θ) = \frac{D}{1 - q R _{θ} ( 0 )} .

F (q, θ) (x) = \frac{[ 1 - q \sum _{k = 0}^{τ} R _{θ} ( k )]}{( 1 - q ) ( D - D _{τ} )} n_{x} - \frac{q}{1 - q} R_{θ} (x),

F (q, θ) (x) = \frac{[ 1 - q \sum _{k = 0}^{τ} R _{θ} ( k )]}{( 1 - q ) ( D - D _{τ} )} n_{x} - \frac{q}{1 - q} R_{θ} (x),

q (θ) = \frac{1}{R _{θ} ( 0 ) + \frac{D}{D _{τ}} \sum _{k = 1}^{τ} R _{θ} ( k )} .

q (θ) = \frac{1}{R _{θ} ( 0 ) + \frac{D}{D _{τ}} \sum _{k = 1}^{τ} R _{θ} ( k )} .

S_{θ}^{τ} (x) = \frac{R _{θ} ( x )}{\sum _{k = 1}^{τ} R _{θ} ( k )} for 1 \leq x \leq τ .

S_{θ}^{τ} (x) = \frac{R _{θ} ( x )}{\sum _{k = 1}^{τ} R _{θ} ( k )} for 1 \leq x \leq τ .

x = 1 \prod τ {S_{θ}^{τ} (x)}^{n_{x}} .

x = 1 \prod τ {S_{θ}^{τ} (x)}^{n_{x}} .

N_{rare} = \frac{D _{τ}}{1 - R _{θ} ( 0 )} .

N_{rare} = \frac{D _{τ}}{1 - R _{θ} ( 0 )} .

N_{abundant} = D - D_{τ} .

N_{abundant} = D - D_{τ} .

N_{classical}

N_{classical}

N_{Chao} = D + n_{1}^{2} /2 n_{2} .

N_{Chao} = D + n_{1}^{2} /2 n_{2} .

τ = τ arg min (bias_{τ} + var_{τ}) .

τ = τ arg min (bias_{τ} + var_{τ}) .

var_{τ} = \frac{1}{M} j = 1 \sum M (P_{τ, j} - P_{τ})^{2} .

var_{τ} = \frac{1}{M} j = 1 \sum M (P_{τ, j} - P_{τ})^{2} .

bias_{τ} = τ^{'} \leq τ max [(P_{τ^{'}} - P_{τ})^{2} - var_{τ^{'}}]_{+},

bias_{τ} = τ^{'} \leq τ max [(P_{τ^{'}} - P_{τ})^{2} - var_{τ^{'}}]_{+},

D (T_{D} - (q, θ)) = \frac{1}{D} i = 1 \sum D I_{(q, θ)}^{- 1} ℓ_{(q, θ)} (X_{i}^{+}) + o_{P} (1) .

D (T_{D} - (q, θ)) = \frac{1}{D} i = 1 \sum D I_{(q, θ)}^{- 1} ℓ_{(q, θ)} (X_{i}^{+}) + o_{P} (1) .

\overset{ˉ}{X}^{τ} - x = 1 \sum τ x S_{θ}^{τ} (x) = 0, with \overset{ˉ}{X}^{τ} = \frac{1}{D _{τ}} i = 1 \sum D_{τ} x_{i}

\overset{ˉ}{X}^{τ} - x = 1 \sum τ x S_{θ}^{τ} (x) = 0, with \overset{ˉ}{X}^{τ} = \frac{1}{D _{τ}} i = 1 \sum D_{τ} x_{i}

I n f = \frac{1}{1000} j = 1 \sum 1000 1_{[N < N_{in f}^{(j)}]}

I n f = \frac{1}{1000} j = 1 \sum 1000 1_{[N < N_{in f}^{(j)}]}

S u p = \frac{1}{1000} j = 1 \sum 1000 1_{[N > N_{s u p}^{(j)}]},

S u p = \frac{1}{1000} j = 1 \sum 1000 1_{[N > N_{s u p}^{(j)}]},

N = \frac{E [ D ]}{1 - q R _{θ} ( 0 )} = \frac{E ^{γ} [ D ]}{1 - q R _{θ^{γ}} ( 0 )} .

N = \frac{E [ D ]}{1 - q R _{θ} ( 0 )} = \frac{E ^{γ} [ D ]}{1 - q R _{θ^{γ}} ( 0 )} .

E^{γ} [D]_{τ} := \frac{1 - q _{τ} R _{θ_{τ}^{γ}} ( 0 )}{1 - q _{τ} R _{θ_{τ}} ( 0 )} D .

E^{γ} [D]_{τ} := \frac{1 - q _{τ} R _{θ_{τ}^{γ}} ( 0 )}{1 - q _{τ} R _{θ_{τ}} ( 0 )} D .

N_{classical} = D_{a} + \frac{D _{τ}}{1 - R _{θ_{τ}} ( 0 )} = D + \frac{D _{τ} R _{θ_{τ}} ( 0 )}{1 - R _{θ_{τ}} ( 0 )} .

N_{classical} = D_{a} + \frac{D _{τ}}{1 - R _{θ_{τ}} ( 0 )} = D + \frac{D _{τ} R _{θ_{τ}} ( 0 )}{1 - R _{θ_{τ}} ( 0 )} .

N_{τ}

N_{τ}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCensus and Population Estimation · Data-Driven Disease Surveillance · Bayesian Methods and Mixture Models

Full text

A truncation model for estimating Species Richness

François Koladjo1, Mesrob I. Ohannessian2, Elisabeth Gassiat3 Corresponding author. Email address: [email protected]

Abstract

We propose a truncation model for abundance distribution in the species richness estimation. This model is inherently semiparametric and incorporates an unknown truncation threshold between rare and abundant counts observations. Using the conditional likelihood, we derive a class of estimators for the parameters in the model by a stepwise maximisation. The species richness estimator is given by the integer maximising the binomial likelihood when all other parameters in the model are know. Under regularity conditions, we show that the estimators of the model parameters are asymptotically efficient. We recover the Chao ${}^{{}^{\prime}}$ s lower bound estimator of species richeness when the model is a unicomponent Poisson ${}^{{}^{\prime}}$ s model. So, it is an element of our class of estimators. In a simulation study, we show the performances of the proposed method and compare it to some others.

1Université de Parakou, ENSPD BP 55 Tchaourou, Bénin

2Toyota Technological Institute at Chicago

3Laboratoire de Mathématiques d’Orsay, Univ. Paris-Sud, CNRS, Université Paris-Saclay, 91405 Orsay, France.

1 Introduction

We consider the “species richness” problem, also known as the problem of estimating the number of species, which arises when a sample of individuals is taken from a population with $N$ classes or species. The usual data set is a series of observed counts $X^{+}_{1},\dots,X^{+}_{D},$ with $D\leq N$ being the total number of distinct species observed in the sample and $N$ is the parameter to be estimated. Estimating $N$ using such abundance data is an old problem that has been tackled in several ways, both by parametric models, including Bayesian models (Bunge & Barger, (2008), Barger & Bunge, (2008)), and by nonparametric models (Wang, (2010)). Due to their flexibility to account for heterogeneity, the nonparametric approaches are those predominantly considered in the last two decades. This setting contains among others the Chao-type estimators developed by Chao and collaborators (see for example Chao & Lee, (1992), Chao & Bunge, (2002), Chao & Jost, (2012)), and the likelihood-based nonparametric estimators of which one can cite Norris & Pollock, (1996, 1998)

Many of these methods, although theoretically founded on a single model, perform the common practice of truncating the data into abundant and rare species. One then assumes that the number of abundant species is adequately represented by the number of distinct such species, whereas the same number leads to an underestimate for the rare species and thus necessitates a correction. Such truncation is generally justified on the basis of avoiding instability. This, however, forces even initially nonparametric models to become effectively parametric, while losing the original hypothesis and the accompanying theoretical guarantees. This motivated us to study this heuristic in a more rigorous light. In particular, we make the following contributions:

•

We give an explicit semiparametric model to represent this truncation practice, where the abundant species are represented by an arbitrary abundance distribution whose support is offset away from the rare range. We partially motivate this as arising from the commonly used Poisson mixtures as being inappropriate for modeling more abundant species.

•

We show that the practice of pure truncation as described above is justified only when the abundant and rare species have abundance distributions whose supports are disjoint. In this case truncation leads to an efficient estimation of the number of species.

•

In general, although pure truncation is not efficient, accounting for the support overlap leads to a hybrid truncation that is a semiparametric procedure which is efficient. We show this by using standard single-parameter families to derive a local minimax bound and a matching (asymptotically) efficient estimator. Coincidentally, we show that this framework recovers several previously suggested estimators as special cases.

•

When the abundance threshold is not known, neither pure truncation nor the hybrid approach can be used directly. For this reason, the proper offset should be obtained from data. We present a model selection approach to resolve this problem. Our experiments show that this approach adapts to the true unknown offset, in the sense that the resulting estimator achieves (almost) the same asymptotic performance as knowing the offset.

•

We illustrate this estimator on both synthetic and real data, showing that our more refined analysis leads to practical improvement.

2 Model and Estimator

2.1 Problem Statement

Assume that $N$ species exist in nature and that each is represented by $X_{1},\cdots,X_{N}$ individuals in a sample. We call $X_{i}$ the abundance of species $i$ in the sample. A classical statistical model of the abundances is to assume that they are independent random variables identically distributed according to a distribution $f_{\nu}(x)$ for $x\in{\mathbb{N}}$ and where $\nu$ is an index within a class of abundance distributions. One of the more common choices of abundance distribution classes are Poisson mixtures indexed by a mixing distribution $\nu$ on $\mathbb{R}_{+}$ :

[TABLE]

Of course, we do not get to access non-observed species, i.e. species for which $X_{i}=0$ . If we let $D$ denote the number of distinct observed species, i.e. $D=\sum_{i=1}^{N}\mathbf{1}\{X_{i}>0\}$ , and if we re-index and relabel those species as $X^{+}_{1},\cdots,X^{+}_{D}$ , then it is easy to show that these observed abundances are independent and identically distributed according to the zero-truncated distribution:

[TABLE]

The central problem of this paper is that of estimating the number of species $N$ with a functional $\widehat{N}(X^{+}_{1},\cdots,X^{+}_{D})$ of the abundances of the observed species. In other words, $\widehat{N}$ needs to complement the number of observed species $D$ with an estimate of the number of non-observed species.

As outlined in the introduction, a long line of research has addressed this problem. But we focus here in particular on a sequence of influential papers (Chao & Lee, (1992), Chao & Yang, (1993)), the methodology of which continues to be used in more recent papers such as Chao & Bunge, (2002) and Wang & Lindsay, (2005). In theory, these results are within the current framework but, in practice, the estimation is done as follows. The data is divided into rare and abundant components according to an abundance threshold $\tau$ . Although their estimators are derived and analyzed under the general model, the theoretical estimators are fed with only those abundances such that $X^{+}_{i}\leq\tau$ , to yield an estimator of the number of rare species $\widehat{N}_{\mathrm{rare}}$ . For the abundant species, they use the trivial estimator:

[TABLE]

The estimate for the total number of species is then simply the sum of both:

[TABLE]

What is the justification behind such truncation? This paper strives to answer this question and to give a more principled model of this common practice, thus leading to a more transparent methodology.

2.2 Truncation Model

To motivate the reason behind truncation, note that the justification often given in this line of work (Chao & Lee, (1992), Chao & Yang, (1993), Chao & Bunge, (2002), Wang & Lindsay, (2005)) is that including the abundant species into the estimator may cause instabilities. We can interpret this as the abundance sampling model, and in particular the Poisson mixture model, as not being a good model for abundant species. In this section, we first give some informal insight as to why this may be the case. We then proceed to present an explicit model to handle this rare-abundant dichotomy.

The abundance model can be traced back to a simple sampling model where individuals are drawn independently and identically (with replacement) from a population, where the frequency of species $i$ is $p_{i}$ . If $m$ such individuals are drawn, let $Y_{1},\cdots,Y_{m}$ denote their species. In this model, the abundance of species $i$ has therefore a binomial distribution of parameters $m$ and $p_{i}$ :

[TABLE]

If the species are not labeled a priori, which corresponds to a random permutation among the $N$ species, then the distribution of a particular abundance becomes a mixture of binomial distributions, with mixture weights at $(m,p_{i})_{i=1,\cdots,N}$ . Note that these abundances are not independent as in the abundance model, but are rather exchangeable. Notwithstanding this fact, we can see that the abundance model of Equation (1) effectively replaces this binomial mixture with a Poisson mixture, which cannot be accurate for abundant species.

The source of the instability is due to the fact that a Poisson distribution with a large mean places much more mass near [math] compared to a corresponding binomial. More precisely, if the model substitutes a binomial mixture with a Poisson mixture, then when an estimator places a mixing mass at a higher abundance, it contributes more to $f_{\nu}(0)$ than a binomial would. This is then interpreted as evidence of more unseen species than the reality, and thus $N$ is overestimated. This is indeed what is observed with such estimators: with larger values of the truncation $\tau$ , the estimate of $N$ tends to increase (see for example the last three columns of Table 2, page 949, and the last two columns of Table 13, page 956, in Wang & Lindsay, (2005)). That said, simply truncating the data is not a theoretically sound approach since the resulting samples no longer follow the hypothesized model. For example Poisson distributions place a positive mass, even if small, beyond any threshold. There is therefore a need to rigorously model rare species, say with mixtures of Poisson distributions, while capturing the possibility that there may be abundant species that have much less influence on our inference about the rare species.

In this paper, we propose the following semiparametric alternative. Let $\tau\in\mathbb{N}_{+}$ and let $\mathcal{F}_{\tau}$ be the family of discrete distributions supported on $\{\tau+1,\tau+2,...\}$ . We assume that abundances follow a distribution $f$ that belongs to a model $\mathcal{P}$ , as follows. For $\alpha>0$ define:

[TABLE]

with $q\in[\alpha,1)$ , where $R_{\theta}$ is a parametric model that represents mostly rare species (e.g. we may think of a finite mixture of Poisson distributions, and more generally we ask for $\theta\in\varTheta$ , where $\varTheta$ is an appropriate subset of $\mathbb{R}^{k}$ for some $k\in\mathbb{N}_{+}$ ), and where $F\in\mathcal{F}_{\tau}$ is a nonparametric component that represents abundant species.

Using Equation (2) and the fact that the nonparametric component vanishes at $x=0$ , this model induces the zero-truncated version as follows:

[TABLE]

We leave the choice of $R_{\theta}$ open, except for certain identifiability and smoothness assumptions that we later spell out in detail. Thus $R_{\theta}$ is not necessarily a Poisson mixture. The choice of a parametric model for $R_{\theta}$ is justified by the fact that even originally nonparametric models are effectively reduced to parametric classes under the constraint of identifiability from a small (truncated) support.

It is now clear that our model in Equation (5) makes explicit the notion that rare and abundant species may coexist. This allows us to bypass heuristics and suggest estimators with provable performance guarantees. In particular, we may harness the basic theory of semiparametric models to establish the efficiency of likelihood-based estimators, and suggest potential model selection mechanism for the choice of $\tau$ . Furthermore, as we make no further assumptions beyond adopting a parametric form for the rare component and dislocating the support of the abundant species away from zero, we have a model that can go beyond a simple justification of truncation. For example, one may think of $F$ as a nonparametric corruption to the data, rather than a legitimate measurement of abundant species, and our analysis and methodology still goes through unaffected.

2.3 Estimator of the Number of Species

The estimator that we propose for $N$ falls under the category of maximum likelihood (MLE)-type M-estimators. In this section we derive and define the estimator, and in the next section we study some of its asymptotic properties.

Let $n_{x}=\sum_{i=1}^{D}\mathbf{1}\{X_{i}^{+}=x\},$ $x\geq 1$ , be the empirical counts of the abundances, which are sufficient statistics for computing likelihoods. The combined likelihood of $N$ and the rest of the model parameters given the samples can then be written as follows:

[TABLE]

It is interesting to note that this likelihood may be decomposed into the product of two likelihoods. The first is the likelihood of $N$ given the rest of the model parameters and the number of distinct samples $D$ . This has a binomial form and we denote it by $L_{b}$ . The second is the likelihood of the rest of the model parameters, given the samples, and we denote it by $L^{+}$ . After substituting $f(x)$ by its expression in $\mathcal{P}$ , in particular letting $f(0)=qR_{\theta}(0)$ , and noting that $D=\sum_{x}n_{x}$ , we can write these likelihoods respectively as follows:

[TABLE]

and

[TABLE]

suggesting two methods to undertake the maximum likelihood estimation from $L$ . Note that some of the earliest works to suggest such a decomposition were Sanathanan, (1972, 1977) (see also Mao & Lindsay, (2003, 2007) for a more recent treatment).

The first method is to maximize directly the likelihood $L$ over all of $(N,q,\theta,F).$ The estimator of $N$ obtained from this method is typically called the unconditional maximum likelihood estimator. For example, some nonparametric models with unconditional estimation methods are proposed in Norris & Pollock, (1996), Böhning & Schün, (2005). The second method to obtain a maximum likelihood estimator of $N$ is to first maximize the likelihood $L^{+}$ from the zero-truncated model $\mathcal{P}^{+}$ to derive the estimators of $q,$ $\theta$ and $F,$ and then to maximize the binomial likelihood $L_{b}$ in the parameter $N$ given that $q$ and $\theta$ are known. This method is known as the conditional maximum likelihood method for estimating $N$ .

We consider here only the conditional maximum likelihood method. Before we proceed with the estimation of $\theta$ , $q$ , and $F$ , note that maximizing the binomial likelihood $L_{b}$ in Equation (7) gives us the form of our estimator:

[TABLE]

The final expression for the estimator therefore consists of estimating $\theta$ by $\widehat{\theta}$ and $q$ by $\widehat{q}$ , in a manner that we shortly outline, and then substituting in Equation (9) to obtain $\widehat{N}=\widehat{N}(\widehat{q},\widehat{\theta})$ . Of course, $N$ is an integer parameter, and we could then take the integer part of the resulting estimate. That said, in what follows we allow ourselves to accept non-integer estimates.

We now proceed to estimate the parameters. We observe first that since $F$ plays no role in the expression for $\widehat{N}$ , we can treat it as a nuisance parameter. The next observation is that to maximize $L^{+}$ , we can successively fix some parameters while we maximize over others. Because $F$ is mostly a nuisance parameter, we maximize the likelihood $L^{+}$ when $q$ and $\theta$ are fixed without further constraining $F$ to be a proper distribution. This approach gives us, at each support point $x$ , the following pseudo-estimator, as a function of $\theta$ and $q$ :

[TABLE]

where $D_{\tau}=\sum_{x=1}^{\tau}n_{x}$ denotes the number of species with abundance no greater than $\tau$ .

The reason we call $\widehat{F}(q,\theta)$ a pseudo-estimator is that it may put negative mass at some of its support points as it is not constrained to be nonnegative. This occurs for example at the non-observed support points of $F$ , that is for a support point $x$ such that $n_{x}=0$ . Despite this fact, the estimators for $\theta$ and $q$ that follow from this choice of $\widehat{F}$ are not sensitive to its impropriety.

Replacing $F$ by its pseudo-estimator in the expression for $L^{+}$ leads to an objective function for $q$ and $\theta$ which may now be maximized in $q$ . This leads to an MLE-type estimator of $q$ , still as a function of $\theta$ :

[TABLE]

Note that $q$ is always non-negative. However, for particular values of $\theta$ , $D$ , and $D_{\tau}$ , it could be larger than $1$ . If this occurs in practice, we simply constrain it to $1$ to obtain a valid probability. The consistency result in the next section shows that this is not a concern, asymptotically.

The last step is to find a proper estimator of $\theta$ . Consider the following simplifying notation. For a fixed $\tau$ , let $S_{\theta}^{\tau}$ denote the truncated version of the density $R_{\theta}$ defined as

[TABLE]

By replacing $F$ and $q$ by their estimators in the conditional likelihood $L^{+}$ , we can show that we obtain (up to factors that do not depend on $\theta$ ) the following truncated likelihood:

[TABLE]

The estimator $\widehat{\theta}$ is then simply a maximizer of Equation (13). We can thus see that $\widehat{\theta}$ is an MLE of the truncated density $S_{\theta}^{\tau}$ , based on the first $\tau$ abundance counts. This completes our estimator construction. Indeed, to estimate $N$ , we first compute $\widehat{\theta}$ directly from the samples by maximizing Equation (13), we then calculate $\widehat{q}(\widehat{\theta})$ using Equation (11), and lastly we substitute both to obtain $\widehat{N}(\widehat{q}(\widehat{\theta}),\widehat{\theta})$ using Equation (9).

We conclude by noting that all the derivations we performed were based on the premise that a value of $\tau$ was given. As $\widehat{q}$ and $\widehat{\theta}$ depend on $\tau$ , in what follows either we make this explicit by writing $\widehat{q}_{\tau}$ and $\widehat{\theta}_{\tau}$ respectively or keep it implicit when the notation gets encumbered. Similarly we write $\widehat{N}_{\tau}$ . We also sometimes use the notation $\widehat{q}(\widehat{\theta})$ instead of $\widehat{q}$ to make it explicit that the estimator of $q$ depends on $\widehat{\theta}$ .

2.4 Relationship to Other Estimators

Despite the fact that $\theta$ is estimated by truncating the model to the abundance values between $1$ and $\tau$ , our estimator differs from the traditional truncation with conditional MLE-type estimators often described in the literature, as overviewed in the introduction. To be precise, assume the same parametric rare-species model is used for $R_{\theta}$ , and recall that in these classical estimators the data is truncated and the conditional MLE is solved using the zero-truncated version of $R_{\theta}$ to obtain $\widehat{\theta}$ , and then the rare-species count is estimated by:

[TABLE]

The abundant species are then assumed to be represented exactly by what is seen:

[TABLE]

The combined estimator is therefore:

[TABLE]

The following proposition identifies the condition under which our estimator is equivalent to this classical estimator.

Proposition 1

If $R_{\theta}$ is supported on $\{0,\ldots,\tau\},$ then the two estimators $\widehat{N}_{\tau}$ and $\widehat{N}_{\mathrm{classical}}$ are equivalent.

Proposition 1 means that if the parametric part $R_{\theta}$ and the nuisance parameter in the model $\mathcal{P}^{+}$ are supported on disjoint sets, then one can split the data set into rare-species data $(X_{i}\leq\tau)$ and abundant-species data $(X_{i}>\tau)$ . In this context, inference on rare species is not affected by the estimation of the nuisance parameter $F$ and thus throwing away high-abundance data is justified. On the other hand when $R_{\theta}$ does extend over all integers, then one should not ignore any part of the data, and instead one should perform a hybrid truncation, as suggested by $N_{\tau}$ in order to obtain efficient estimators.

Thus far we have considered the general context for any eligible distribution $R_{\theta},$ some particular cases enable us to make simple and concrete connections between $\widehat{N}_{\tau}$ and other popular estimators that come close to falling within our framework. In particular, Chao, in Chao, (1984), suggests the following popular estimator

[TABLE]

The following proposition shows that our estimator $\widehat{N}_{\tau}$ , for $\tau=2$ and $R_{\theta}$ corresponding to a pure Poisson distribution, is equivalent to Chao’s $\widehat{N}_{\mathrm{Chao}}$ . As such, we can interpret our estimator as a generalization of Chao’s, where $\tau$ is no longer restricted to $2$ and where $R_{\theta}$ may be more general than a pure Poisson distribution.

Proposition 2

*Assume that $\tau=2$ and $R_{\theta}(x)=\theta^{x}e^{-\theta}/x\mathpunct{}!$ for all $x\geq 0.$ Then $\widehat{N}_{\tau}=\widehat{N}_{\mathrm{Chao}}.$ *

It is worth noting that Zelterman in Zelterman, (1988) explicitly considers the pure Poisson model with access only to the first two counts $n_{1}\neq 0$ and $n_{2}$ , and suggested $\widehat{\theta}_{\mathrm{Zelterman}}=2n_{2}/n_{1}$ as an estimator for $\theta$ , showing certain robustness properties under heterogeneity in the true model (see Zelterman, (1988) for more details). It is indeed straightforward to verify that for $\tau=2$ and a pure Poisson model for $R_{\theta}$ , we have that our estimator maximizing the truncated likelihood of Equation (13) corresponds to that of Zelterman: $\widehat{\theta}_{\tau}=\widehat{\theta}_{\mathrm{Zelterman}}$ .

An effective proof of Proposition 2 is given in the work of Böhning et al in Böhning *et al. *, (2013) within an alternative framework: using the conditional expectation of $f_{0}$ . In the Appendix, we propose a simpler proof that is more in line with our framework, by plugging-in $\widehat{\theta}_{\mathrm{Zelterman}}$ , which as noted is the correct conditional ML estimate, into our expression for $\widehat{N}_{\tau}.$

We end by noting that if the abundant species are not taken into account, we would have the estimator of $N$ given by $\widehat{N}_{\mathrm{Zelterman}}=D/[1-\exp(-\widehat{\theta}_{\mathrm{Zelterman}})]$ . This estimator is in the spirit of pure-truncation, and would clearly deviate from $\widehat{N}_{\tau}$ (the denominator has no $q$ factor). Since we establish the latter to be consistent within our model, then it follows that the former is not (see also Böhning & van der Heijden, (2009) for a more quantitative comparison between Chao’s and Zelterman’s estimators of $N$ .)

2.5 Choice of $\tau$ via Model Selection

To end the discussion of our estimator of $N$ , we stress once again that $\widehat{N}_{\tau}$ depends on the integer truncation parameter $\tau$ , which delimits the zone of influence of the abundant species through the support of the nuisance parameter $F$ . When $\tau$ is not known, we need a procedure to estimate this parameter. This is effectively a model selection problem, which we now address using the Goldenshluger-Lepski (G-L) method as inspiration. The G-L method was introduced in Goldenshluger & Lepski, (2011) in the context of bandwidth selection for kernel density estimation. In the current paper we use it heuristically, without formal proofs. Experimental evidence, however, suggests that the method is very effective.

The principle of the method is as follows. As our estimator is of the form $\widehat{N}=D/(1-\widehat{P}(0))$ , we focus on the problem of estimating $P(0)=qR_{\theta}(0)$ . Let us drop the $(0)$ argument from the notation, to make the exposition clearer. Assume that we have a known upper bound $\tau_{\max}$ on the largest value $\tau$ could take, and let $\tau_{\min}$ be the least $\tau$ that enables the necessary identifiability assumptions. If we relax the requirement that $F$ is positive on its support, we have successively smaller nested models as $\tau$ varies from $\tau_{\min}$ to $\tau_{\max}$ . Each of these models has a corresponding version of our estimator, that we denote by $\widehat{P}_{\tau}$ . The (squared) bias of each model is $\mathsf{bias}_{\tau}=(\mathbf{E}[\widehat{P}_{\tau}]-P)^{2}$ . The variance of each model is $\mathsf{var}_{\tau}=\mathbf{E}[(\widehat{P}_{\tau}-\mathbf{E}[\widehat{P}_{\tau}])^{2}]$ . The mean squared error risk decomposes as usual into the sum of bias and variance, $\mathsf{risk}_{\tau}=\mathbf{E}[(\widehat{P}_{\tau}-P)^{2}]=\mathsf{bias}_{\tau}+\mathsf{var}_{\tau}$ . Now observe the following:

•

For $\tau\leq\tau_{0}$ , the consistency result of Theorem 1 tells us we are asymptotically unbiased.

•

For $\tau=\tau_{0}$ Theorem 2 shows that we are efficient and therefore asymptotically we have the least variance.

•

For $\tau<\tau_{0}$ , the estimator becomes inefficient and the variance may be higher. Intuitively, this is because less of the data is used to estimate $\theta$ when the truncation is stricter.

•

For $\tau>\tau_{0}$ , Theorem 1 tells us that we may have a non-vanishing bias. However, the variance itself may be lower simply because more $F$ -corrupted data is used to converge to an incorrect value of $\theta$ .

The inevitable bias-variance tradeoff thus manifests itself in this framework, and the best compromise in terms of risk will be achieved at the correct model class $\tau_{0}$ . If accurate proxies $\widehat{\mathsf{bias}}_{\tau}$ and $\widehat{\mathsf{var}}_{\tau}$ are available, then we may empirically select a model $\widehat{\tau}$ near $\tau_{0}$ , by minimizing

[TABLE]

The bootstrap method is one effective way for estimating $\mathsf{var}_{\tau}$ . In its simplest version, bootstrap consists of resampling $D$ points from the data and computing an estimator $\widetilde{P}_{\tau}$ from the resampled data. Then this is repeated a number of times, say $j=1,\cdots,M$ , and the variance is estimated as:

[TABLE]

While the resampling process of the bootstrap is good at quantifying the relative (to $\widehat{P}_{\tau}$ ) variability of the resampled estimators, it offers no absolute reference point, crucial for estimating the bias. Luckily, as we have argued, the larger model classes have small bias and can themselves be used as a reference point. The Goldenshluger-Lepski method suggests the following method to obtain a bias proxy:

[TABLE]

where $[\cdot]_{+}$ stand for the non-negative part. The justification and behavior for this bias proxy needs to be rigorously established, as is done for kernel width selection in Goldenshluger & Lepski, (2011). For our heuristic use, we provide simply the intuition behind it. This formula can be interpreted by noticing that the maximum of $(\mathbf{E}[\widehat{P}_{\tau^{\prime}}]-\mathbf{E}[\widehat{P}_{\tau}])^{2}$ over $\tau^{\prime}\leq\tau$ is indeed approximately the bias since, as we described, the smaller models are (asymptotically) unbiased. But because we only have access to $(\widehat{P}_{\tau^{\prime}}-\widehat{P}_{\tau})^{2}$ instead of $(\mathbf{E}[\widehat{P}_{\tau^{\prime}}]-\mathbf{E}[\widehat{P}_{\tau}])^{2}$ , and since the smaller models have higher variance, we place a conservative confidence bound on the $\tau^{\prime}$ end using $\widehat{\mathsf{var}}_{\tau^{\prime}}$ in order not to overestimate the bias.

Equations (14) (the selection of $\widehat{\tau}$ ), (15) (the bootstrap variance proxy), and (16) (the bias proxy) completely specify a heuristic model selection procedure for estimating the integer truncation parameter $\tau$ .

3 Analysis of the Estimator

3.1 The semiparametric framework

We now analyze the convergence and optimality of our estimator in the context of efficient estimation, when the model contains nuisance parameters. We do so particularly in order to handle the nonparametric component $F$ within our semiparametric model. In the absence of such nuisance parameters, efficiency may be defined in terms of attaining the Cramér-Rao bound. In regular parametric models, the Cramer-Rao bound is the variance of the score function, itself (often) defined as the derivative of the log-likelihood, and efficient estimators are at first order empirical means of the score function. The nuisance parameters, however, can lead to unavoidable loss in the accuracy of any estimator. The notion of efficiency can then be extended by assessing new lower bounds to the variance of the parameters of interest. We provide the details in Section A.4 below, and describe here what is useful to state our results.

One can define a set $\mathcal{\dot{P}}_{F}^{+}$ of score functions relatively to the nonparametric part of the model, built using one dimensional submodels (see Section A.4 for details). Then, if $\dot{\ell}_{(q,\theta)}$ is the usual score function (given by the partial derivative and gradient with respect to $q$ and $\theta$ respectively of the log-likelihood in the full model), the efficient score function related to $(q,\theta)$ is then defined component-wise as $\widetilde{\ell}_{(q,\theta)}=\dot{\ell}_{(q,\theta)}-\varPi_{F}\dot{\ell}_{(q,\theta)},$ where $\varPi_{F}$ is the orthogonal projection onto the closure of the linear space spanned by $\mathcal{\dot{P}}_{F}^{+}.$ The efficient score functions play the same role for efficient estimators (if they exist) as the ordinary score functions for the maximum likelihood estimators in a parametric model with no nuisance parameter. Namely, they lead to the best asymptotic variance for any estimator. The corresponding efficient Fisher information $\widetilde{I}_{(q,\theta)}$ is a matrix whose components are the variances and covariances of the various components of the vector of efficient score functions.

As such, this leads to what we shall give as formal definition of the properties of consistency and efficiency:

Definition 1

As $N\to\infty$ , an estimator sequence $T_{D}=(\widehat{q},\widehat{\theta})$ is:

•

Consistent*, if $T_{D}\to(q,\theta)$ in probability.*

•

Efficient* (asymptotically), if*

[TABLE]

Note that the typical asymptotics for estimator sequences rely on increasing sample size. The sample size in our problem is $D$ , as the samples consist of the positive (observed) abundances $X_{1}^{+},\cdots,X_{D}^{+}$ . Thus, the sample size is a random quantity. Despite this, it is clear that as $N\to\infty$ , we also have that $D\to\infty$ in probability, and we therefore think of the two asymptotic notions interchangeably.

One of the challenges is that in many models the efficient score is not amenable to be used in the same way as the ordinary score because the orthogonal projection $\varPi_{F}$ might not be available in closed form. In Proposition 3 (stated and proved in Section A.4), we show that such a closed form can be obtained in our model, and give the expressions that ensue for the efficient score functions for estimating the parameters $\theta$ and $q$ in the model $\mathcal{P}^{+}.$

3.2 Consistency and Efficiency

In what follows, when the true model lies within the hypothesized class $\mathcal{P}$ , we refer to the true parameters by $\theta_{0}$ , $q_{0}$ , and $F_{0}$ , and to the true truncation by $\tau_{0}$ . We first list some regularity assumptions that we have recourse to throughout.

Assumptions

[Compactness] $\varTheta$ is a compact subset of $\mathbb{R}^{k}$ . 2. 2.

[Identifiability] The parameter $\theta$ is identifiable from the truncated density $S_{\theta}^{\tau}$ , as defined by Equation (12). 3. 3.

[Continuity] For all $x$ in $\{1,\ldots,\tau\},$ $\theta\mapsto R_{\theta}(x)$ is a continuous function of $\theta$ , and $R_{\theta}(x)\geq\delta>0$ for all $\theta$ in $\varTheta$ and $x\leq\theta_{0}$ .

Let us now move to the main results of this section, the consistency and efficiency of $\widehat{q}_{\tau}$ and $\widehat{\theta}_{\tau}$ whenever $\tau\leq\tau_{0},$ and some further properties that give more insight into these estimators. We begin with the consistency result stated below as Theorem 1.

Theorem 1

Under Assumptions 1-3, as $N$ tends to infinity, the following results hold:

$(i)$

If $\tau\leq\tau_{0},$ then $\widehat{\theta}_{\tau}$ and $\widehat{q}_{\tau}$ converge in probability to $\theta_{0}$ and $q_{0}$ respectively;

$(ii)$

If $\tau>\tau_{0},$ then $\widehat{\theta}_{\tau}$ converges in probability to the set of maximizers of $M^{\tau}(\theta)=\sum_{x=1}^{\tau}f^{+}(x)\log S_{\theta}^{\tau}(x)$ .

The results in Theorem 1 are remarkable since they ensure the consistency of $\widehat{\theta}_{\tau}$ and $\widehat{q}_{\tau}$ for a fixed $\tau$ , as long as it is smaller than or equal to its true value $\tau_{0}$ and identifiability holds. If, however, one chooses $\tau$ greater than $\tau_{0},$ then the proposed estimators may not be consistent. (This leads to the challenge of choosing $\tau$ via model selection when $\tau_{0}$ is unknown, as described in Section 2.5). We now complement this consistency result with efficiency properties.

Theorem 2

Consider Assumptions 1-3, and assume further that $\theta\mapsto R_{\theta}(x)$ is $\mathcal{C}^{2}$ , that $\theta_{0}\in\varTheta^{\circ}$ , and that the efficient Fisher information is non-singular at $(q_{0},\theta_{0})$ . Then, $(\widehat{q}_{\tau_{0}},\widehat{\theta}_{\tau_{0}})$ is asymptotically efficient at $(q_{0},\theta_{0}).$

Remark 1

The estimators $\widehat{q}_{\tau}$ and $\widehat{\theta}_{\tau}$ have the following properties:

$(i)$

$\widehat{q}_{\tau}$ * depends on the observations $x_{i}$ no greater than $\tau$ and on the cardinality of those $x_{i}$ that are greater than $\tau.$ *

$(ii)$

$\widehat{\theta}_{\tau}$ * depends only on the observations $x_{i}$ no greater than $\tau.$ *

These follow either from direct inspection or from the proof of Lemma 2 in the Appendix. In particular, $\widehat{\theta}$ solves the efficient score equation (41) which depends only on abundances $x_{i}$ no greater than $\tau.$ Now, from equation (40), for a given estimator $\widehat{\theta}$ of $\theta,$ the estimator $\widehat{q}(\widehat{\theta})$ depends on the $x_{i}$ greater than $\tau$ only through their cardinal $D-D_{\tau}$ and the property follows.

Theorem 2, asserts the efficiency of $\widehat{\theta}_{\tau}$ and $\widehat{q}_{\tau},$ and through them of the corresponding estimator of the total number of species $\widehat{N}_{\tau}$ . Remark 1 also sheds light on the fact that the latter depends only on: (1) the threshold $\tau,$ (2) the number of observed species $D$ and (3) on the abundances of rare species (those that are not greater than $\tau$ ). In other words, as in the case of pure truncation, the abundant species contribute only through their cardinality. That said, $\widehat{N}_{\tau}$ distinguishes itself by using this cardinality to estimate how to weigh appropriately the respective contributions of both the rare and abundant species, using the parameter $q$ .

4 Simulations and Experiments

To illustrate the impact of truncation on our ability to estimate the number of species, we give some numerical simulations and experiments. To make our theoretical work concrete and results easily reproducible, we consider simple parametric families. In particular we look at a single Poisson distribution and a Gamma-Poisson mixture, which gives rise to the negative binomial distribution. In Section 4.1, we perform synthetic experiments for both, and use this to illustrate the heuristic method of selecting the best truncation. In Section 4.2, we consider real data in the form of literary texts, and confine ourselves to the negative binomial model. In order to be able to compare to a known ground truth, we adapt our number of species framework to the very related observational richness problem, and show that the choice of truncation has a significant impact on estimation accuracy.

4.1 Number of Species Simulations

4.1.1 Algorithms to compute $\widehat{\theta}_{\tau}$

As we take $R_{\theta}$ to be a parametric family, many of the EM-style MLE algorithms for parameter estimation in such frameworks can be adapted to zero- to $\tau$ -truncated versions of the distributions. This is all that’s needed since, for a fixed value of $\tau,$ computing $\widehat{\theta}_{\tau}$ is an optimization problem that amounts to solving the efficient score equations in (38), and by Theorem 1 this is equivalent to solving equation (41). For example, when $R_{\theta}$ is a Poisson distribution, it is not difficult to check that the truncated MLE (41) becomes exactly

[TABLE]

leading to the fixed point equation $\theta=\bar{X}^{\tau}\frac{\textbf{P}_{\theta}(\tau)-\exp(-\theta)}{\textbf{P}_{\theta}(\tau-1)}$ in which $\textbf{P}_{\theta}$ stands for the cumulative distribution function of the Poisson model with parameter $\theta.$ This is equivalent to moment-matching and the solution $\widehat{\theta}_{\tau}$ could be found numerically by performing a bisection search, for example. Similar parameter searches can be performed for the truncated negative binomial distribution that we consider in this section. In a more complex model where $R_{\theta}$ is a finite mixture of Poisson distributions, that is when $R_{\theta}(x)=\sum_{j=1}^{J}\pi_{j}R_{\theta_{j}}(x)$ for all $x\geq 0,$ we can derive an EM algorithm for the truncated MLE similarly to the classical Poisson mixture. We do not elaborate this further, except to mention that each EM iteration entails the solution of fixed point equations, as in (17), for each Poisson component.

Design

To investigate the performance of the new estimator and compare it to other existing estimators, we conducted a set of experiments with synthetic data.

In the first set of these experiments, we take the abundances of rare species to be distributed according to a single Poisson distribution with parameter $\theta$ and the nuisance distribution (of abundant species) is the uniform distribution on $\tau^{*},\ldots,\tau_{max}.$ The resulting distribution has density $qR_{\theta}(x)+(1-q)U(x)$ with $0<q<1$ and $U$ the aforementioned uniform distribution. Now, for any fixed $N\in\{200,1000,5000,10000\},$ we generate a sample of size $N$ from the Bernoulli model with parameter $q\in\{0.4,0.6,0.8\},$ then generate the corresponding counts observations according to the Poisson or uniform model. The parameters $\tau^{*}$ and $\tau_{max}$ are fixed equal $10$ and $40$ respectively whereas $\theta$ ranges over $\{0.6,1,1.5\}.$ The observed zero-truncated counts are used to compute our new estimator $\widehat{N}_{\widehat{\tau}}$ and some other existing estimators with which $\widehat{N}_{\widehat{\tau}}$ will be compared.

To show that the results extend to other parametric families, we perform a limited set second experiments, where we take the abundance of rare species to be distributed according to a Gamma-Poisson mixture, which leads naturally to the negative binomial distribution. In particular, in this case $\theta$ is two-dimensional, consisting of real parameters $r>0$ and $s>0$ , and in Equation (1) we have $\nu_{\theta}=\Gamma(r,s)$ . This results in $R_{\theta}$ being the negative binomial distribution with parameters $r$ and $p=1/(1+s)$ . We fix $p=0.8$ and take $r$ to vary over the range $\{0.5,1,2\}$ . Larger values of $N$ are needed to learn this model even in the absence of nonparametric noise. We consider the range of $N\in\{10,000,20,000,50,000\},$ and we generate a sample of size $N$ from the Bernoulli model with parameter $q\in\{0.4,0.6,0.8\},$ then generate the corresponding counts observations according to the negative binomial or uniform model. That is, the observational model is as before, $qR_{\theta}(x)+(1-q)U(x)$ .

Risk approximation using G-L method

We use the simulations as an opportunity to illustrate the G-L method and show the quality of the risk estimation by the proposed proxy in the selection rule. As displayed in Figure 1, the proxy $\widehat{\mathsf{bias}}_{\tau}+\widehat{\mathsf{var}}_{\tau}$ provides a good approximation of the true risk when $\widehat{\mathsf{var}}_{\tau}$ is estimated by a bootstrap procedure as in Equation (15). Note that in this numerical example we calculate the risk, bias proxy, and variance proxy for $N$ instead of $P(0)$ . The approximation is remarkably accurate especially in the region where the estimator $\widehat{N}_{\tau}$ is asymptotically unbiased $($ that is for $\tau\leq\tau^{*}).$ Its remains satisfactory, but not overly so, for some $\tau$ greater than $\tau^{*}$ . This indicates that the bootstrap procedure is a good choice to estimate $\widehat{\mathsf{var}}_{\tau}.$ Note that Figure 1 corresponds to the results of simulations of the single Poisson model with parameters $q=0.6,$ $\theta=1,$ and $N=1000.$ We obtain similar results for all other parameter choices.

Performances of $\widehat{N}_{\widehat{\tau}}$

We focus on the performance of $\widehat{N}_{\widehat{\tau}}$ by calculating its Monte-Carlo mean and the renormalized standard error $(\frac{S_{e}}{N})$ based on $1000$ samples. We also investigate the bootstrap-based confidence interval for $N$ by providing the estimated non-coverage probabilities

[TABLE]

and

[TABLE]

where $I^{(j)}=[N_{inf}^{(j)},N_{sup}^{(j)}]$ is the bootstrap-based confidence interval using the estimated model from the $j^{th}$ Monte-Carlo sample. For the single Poisson model, the results are summarized in Table 1. It is clear that the renormalized $Se$ decreases when $\theta$ grows and increases as $q$ becomes larger. As the small values of $\theta$ characterize small abundances and that a high value of $q$ means that there is a large number of rare species in community $($ according to the simulated model $),$ the observed variation of $Se$ suggests that a high number of rare species will be estimated with larger variance. We can also notice that the $Se$ decreases with $N$ in all simulated configurations showing the accuracy of the method when $N$ becomes larger. As the large values of $N$ describe the asymptotic regime of the estimators $\widehat{\theta}_{\tau}$ and $\widehat{q}_{\tau},$ we believe that the observed accuracy is related to the asymptotic efficiency of those estimators which improves the variance and then the mean square error $($ MSE $)$ of $\widehat{N}_{\widehat{\tau}}$ as will be seen later. Table 2 summarizes the results for the Gamma-Poisson mixture model, with very comparable observations. Note that both sets of experiments show that we cannot rely on bootstrap confidence intervals as true intervals for the estimator. While the bootstrap is adequate in estimating the variability of the estimator, it does not accurately convey its location. It exhibits a clear skew to smaller values, which could be explained by the fact that resampling from the base distribution reduces the number of distinct observations. Therefore more principled methods are needed to go beyond point estimates in species richness estimation. One such avenue is through the use of concentration inequalities, Ben Hamou *et al. *, (2017).

Comparison with other estimators

We end the simulations by comparing the proposed estimator of the number of species to other existing one in literature. We focus entirely on the single Poisson model, which represents the ground truth assumption of many of theses estimators. We consider the Chao’s estimator $\widehat{N}_{Ch_{0}}$ defined as lower bound for $N$ and proposed in Chao, (1984), the coverage based estimator $(\widehat{N}_{CL})$ proposed in Chao & Lee, (1992) by Chao and Lee, the estimator $\widehat{N}_{CB}$ of $N$ using the expected proportion of duplicate species in the sample (by Chao and Bunge in Chao & Bunge, (2002)), the nonparametric MLE $\widehat{N}_{WL_{0}}$ of $N$ using a penalized likelihood (by Wang and Lindsay in Wang & Lindsay, (2005)) and $\widehat{N}_{LB}:$ an extension of Chao’s estimator proposed by Lanutheang and Böhning in Lanumteang & Böhning, (2011). The criteria used for this comparison (Mean, rMAE: relative Mean Absolute Error and rMSE: relative Mean Square Error) are computed and presented in Table 3. The six estimators display a good performance in all simulated configurations and $\widehat{N}_{\widehat{\tau}}$ seems to better estimate $N$ than all other methods. This is quantified in Table 3, by the remarkably small value of $rMSE$ as compared to the others. This shows that, despite our results being about the asymptotic efficiency of $\widehat{\theta}_{\tau}$ and $\widehat{q}_{\tau}$ , we can expect finite-sample improvements for the estimator $\widehat{N}_{\widehat{\tau}}$ , when $N$ is moderately large. Also note that all six estimators become less reliable for very small value of $\theta$ or large value of $q$ explaining thus the common difficulty for these approaches to better approximate $N$ in the case of a large number of rares species, which touches upon the inherent problems of unidentifiability Mao & Lindsay, (2007).

4.2 Observational richness in text data

Rather than estimating the absolute number of species, an important extension of the species richness problem is concerned with estimating the number of distinct species to be observed in a sample larger than the current sample of individuals. Indeed, the abundance data $X_{1}^{+},\cdots,X_{D}^{+}$ is ostensibly obtained by performing a sampling of individuals. If the said sample is enlarged, then how do the new abundances relate to the original ones? In the words of Fisher Fisher *et al. *, (1943), in a pure Poisson abundance model: “Obviously, [the parameter $\lambda$ ] will be proportional to the size of the sample taken […]”. This is most easily seen in the individual sampling model of Equation (3): when the binomial size parameter is changed from $n$ to $n^{\prime}=\gamma n$ , the parameters of the corresponding Poisson mixture are changed from $\lambda=np_{j}$ to $\lambda^{\prime}=n^{\prime}p_{j}=\gamma\lambda$ .

Generally in a Poisson mixture model, therefore, a $\gamma$ factor increase in the sample size is equivalent to a $\gamma$ dilation of the mixture distribution. Let $\mathbf{E}^{\gamma}[D]$ denote the expected number of distinct symbols in the enlarged sample, and thus $\mathbf{E}^{1}[D]=\mathbf{E}[D]=N(1-qR_{\theta}(0))$ . The observational richness estimation problem can thus be concretely stated as the problem of estimating $\mathbf{E}^{\gamma}[D]$ , based on $X_{1}^{+},\cdots,X_{D}^{+}$ .

One application of the observational richness problem is to forecast the vocabulary of an author, from a portion of their text. This was popularized in the work of Efron & Thisted, (1976), who applied this methodology to the complete works of William Shakespeare. The problem goes back to the work of Good & Toulmin, (1956), who approached it from an empirical Bayesian perspective, without any specific parametrization. The earlier work of Fisher *et al. *, (1943) also implicitly addressed the same problem.

Here, we restrict ourselves to the context of a parametric Poisson mixture abundance model for $R_{\theta}$ , that is as in Equation (1), with $\nu=\nu_{\theta}$ appropriately parametrized by $\theta$ . We require the family of such densities to be closed under dilation, as in for all $\theta$ and $\gamma>0$ , there exists $\theta^{\gamma}$ , such that for all measurable subsets $A$ , $\nu_{\theta^{\gamma}}(A)=\nu_{\theta}(A/\gamma)$ . Furthermore, we assume that for fixed $\gamma$ , the transformation $\theta\mapsto\theta^{\gamma}$ is continuous in the sense that if a sequence $\theta_{i}\to\theta$ then the sequence $\theta^{\gamma}_{i}\to\theta^{\gamma}$ . Note that for discrete mixtures, the scaling simply shifts the supports by $\gamma$ , and for continuous mixtures it expands and scales the density by $\gamma$ , and the requirement in either case is for the resulting density to remain an element of the parametric family.

As we focus primarily on text data, the Gamma-Poisson mixture family is very well-suited. Recall that in this case $\theta$ is two-dimensional, consisting of real parameters $r>0$ and $s>0$ , and $\nu_{\theta}=\Gamma(r,s)$ , and the corresponding negative binomial distribution with parameters $r$ and $p=1/(1+s)$ . To dilate the Gamma distribution, it is easy to see that one simply scales $s^{\prime}=\gamma s$ . This corresponds to a transformation of the negative binomial parameter $p^{\prime}=1/(1+\gamma(1-p)/p)$ .

This paper’s framework applies to this problem as follows. If the rare abundances are well modeled by a Gamma-Poisson mixture while the abundant ones are not, then our framework allows us to efficiently learn the parameters $q$ and $\theta$ . By continuity, for fixed $\gamma$ we also have an efficient estimator of $\theta^{\gamma}$ . Since $N$ is assumed to stay constant, we then have

[TABLE]

We could therefore use our estimates $\widehat{q}_{\tau}$ and $\widehat{\theta}_{\tau}$ to evaluate $\widehat{\theta}_{\tau}^{\gamma}$ and thus to estimate $\mathbf{E}^{\gamma}[D]$ as follows:

[TABLE]

The data we look at is French playwrite Molière’s Tartuffe play, which we gradually observe a portion of and try to estimate the number of distinct vocabulary words. Thus, the scale $\gamma$ is the ratio of the total text size to the size of the observed text, varying from [math] to $100\%$ . For this problem, we illustrate $\widehat{\mathbf{E}^{\gamma}[D]}_{\tau}$ for various choices of $\tau$ and also the Goldenshluger-Lepski selected $\widehat{\tau}$ in Figure 2. Note how quickly the result becomes an accurate estimate of the vocabulary. But most importantly, note how sub-optimal choices of the truncation can adversely affect the performance of the estimator.

5 Conclusion

In this paper, we revisited the species richness estimation problem and studied a commonly followed practice of truncating the data into rare and abundant species. We proposed a semiparametric framework to model such a truncation as a parametric component well-suited to model rare species and a nonparametric nuisance component to cover the abundant species in an agnostic manner. We showed that asymptotic efficiency in this framework requires handling the truncation more delicately. This is in particular true if the rare species model has a significant overlap with the abundant species. Finally, we proposed a heuristic method to learn a good truncation threshold from data.

Several possible avenues of investigation may be proposed. We already mentioned the importance of going beyond point estimates. One would also like to relax the assumption that the abundant species are truly located entirely away from zero. In particular, it is important to handle the situation when such a dichotomy arises from an underlying binomial mixture model. Some recent approaches to species richness have successfully used Chebyshev polynomials as a fitting model, see for example Orlitsky *et al. *, (2016), and one would like to understand the relationship between such fits and mixtures of Poissons. Finally, one would hope that a truncation threshold that automatically conforms to the underlying model could make the most of the available data and thus give a fundamental theoretical edge, perhaps in the form of adaptive rates.

Appendix A Proofs

A.1 Proof of Proposition 1

For any fixed $\tau,$ the classical conditional MLE satisfies

[TABLE]

Let us consider now the new conditional MLE proposed in this work.

[TABLE]

But using equation (11), we have

[TABLE]

from which we get

[TABLE]

Now, if $R_{\theta}$ is supported on $\{0,\ldots,\tau\},$ then $\sum_{k=1}^{\tau}R_{\widehat{\theta}_{\tau}}(k)$ equals $1-R_{\widehat{\theta}_{\tau}}(0)$ in the last expression which finally gives $\widehat{N}_{\tau}=\widehat{N}_{\mathrm{classical}}$ . $\Box$

A.2 Proof of Proposition 2

For $\tau$ equals $2$ and $R_{\theta}$ being the Poisson distribution with parameter $\theta,$ it is not difficult to see that $\widehat{\theta}_{\tau}=\widehat{\theta}_{\mathrm{Zelterman}}=2n_{2}/n_{1}.$ Then,

[TABLE]

The last equality holds by replacing $\widehat{\theta}_{\tau}$ and $D_{\tau}$ by $2n_{2}/n_{1}$ and $n_{1}+n_{2}$ respectively. $\Box$

A.3 Proof of Theorem 1

To prove $(i),$ we use the fact that $\widehat{\theta}$ is the maximum likelihood estimator in the model with density $S_{\theta}^{\tau}$ . Note that maximizing the likelihood of equation (13) amounts to maximizing in $\theta$ the criterion $\mathcal{L}_{D}(\theta)=\sum_{x=1}^{\tau}\frac{n_{x}}{D_{\tau}}\log S_{\theta}^{\tau}(x)$ which as $N$ tends to infinity converges almost surely to $\mathcal{L}(\theta)=\sum_{x=1}^{\tau}S_{\theta_{0}}^{\tau}(x)\log S_{\theta}^{\tau}(x)$ when $\tau\leq\tau_{0}$ . Moreover, we have

[TABLE]

On the right hand side of this inequality, $\frac{n_{x}}{D_{\tau}}-S_{\theta_{0}}^{\tau}(x)$ converges almost surely, thus in probability, to zero. Also, $|\log S_{\theta}^{\tau}(x)|$ is bounded since $R_{\theta}(x)\geq\delta>0$ , for all $\theta\in\varTheta$ and all $x\leq\tau.$ We conclude that

[TABLE]

It is easy to see that

[TABLE]

attains uniquely its maximum $($ equals zero $)$ at $\theta_{0}$ since the true model $S_{\theta_{0}}^{\tau}$ is identifiable, as assumed. We then obtain

[TABLE]

where $d(\theta,\theta_{0})$ is the euclidean distance between $\theta$ and $\theta_{0}$ . As $\widehat{\theta}$ maximizes $\mathcal{L}_{D},$ we have $\mathcal{L}_{D}(\widehat{\theta})\geq\mathcal{L}_{D}(\theta_{0})-$ o ${}_{\mathbb{P}}(1).$ This, together with the condition in equation (18) and the above convergence in probability, entails that $\widehat{\theta}$ converges in probability to $\theta_{0}$ as $N$ tends to infinity. This result holds from Theorem $5.7$ in van der Vaart, (1998).

To end part $(i)$ of the theorem, recall Equation (11):

[TABLE]

We then observe from the law of large numbers that as $N$ tends to infinity, $\frac{D}{D_{\tau}}$ converges almost surely to $\frac{1-q_{0}R_{\theta_{0}}(0)}{q_{0}\sum_{k=1}^{\tau}R_{\theta_{0}}(k)}$ when $\tau\leq\tau_{0}.$ Recall that we assume $R_{\theta}$ to be continuous in $\theta$ for each $x$ . Thus using the continuous map theorem and the convergence in probability of $\widehat{\theta}$ to $\theta_{0}$ , we find that $R_{\widehat{\theta}}(0)$ and $\sum_{k=1}^{\tau}R_{\widehat{\theta}}(k)$ converge in probability to $R_{\theta_{0}}(0)$ and $\sum_{k=1}^{\tau}R_{\theta_{0}}(k)$ respectively when $\tau\leq\tau_{0}$ . We finally obtain the convergence in probability of $\widehat{q}(\widehat{\theta})$ to

[TABLE]

using once again the continuous map theorem. This ends the proof of part $(i)$ of Theorem 1.

Similar arguments to what we have given here can be used to prove part $(ii)$ of the Theorem, namely that if $\tau>\tau_{0}$ , then $\widehat{\theta}$ converges in probability to the set of maximizers of $M^{\tau}(\theta)=\sum_{x=1}^{\tau}f^{+}(x)\log S_{\theta}^{\tau}(x)$ , in the sense that the probability of falling in an $\epsilon$ -dilation of this set tends to $1$ as $N\to\infty$ . $\Box$

A.4 Efficient score functions and efficient Fisher information

We now build up some notation.Let $\mathcal{G}$ denote the set of measurable functions defined on the support of $F$ by

[TABLE]

For a given $G$ in $\mathcal{G},$ a real number $a$ and a vector $b$ of dimension $k$ let us define $q_{t}=q+at,$ $\theta_{t}=\theta+bt$ and $F_{t}=F(1+tG).$ This parametrization of $F,$ $q$ and $\theta$ defines a path (a one-dimensional sub-model) $f_{t}^{+}=f_{(q_{t},\theta_{t},F_{t})}^{+}$ in the model $\mathcal{P}^{+}.$ To simplify the notation, we let $f^{+}$ stand for $f_{(q,\theta,F)}^{+}$ and $f$ for $f_{(q,\theta,F)}$ . Recall the definition of score functions.

Definition 2

A differentiable path is a map $t\mapsto f_{t}^{+}$ from a neighborhood $[0,\varepsilon)$ of [math] to $\mathcal{P}^{+}$ with $f_{0}^{+}=f^{+}$ such that, for some measurable real valued function $g,$ one has

[TABLE]

The one-dimensional sub-model $\left\{f_{t}^{+},t\in[0,\varepsilon)\right\}$ is then said to be differentiable in quadratic mean at $f^{+}$ with score function $g$ .

A more useful way to determine the score function of a model such as $\left\{f_{t}^{+},t\in[0,\varepsilon)\right\}$ is to take the derivative with respect to $t$ of the log-likelihood at $t=0,$ that is

[TABLE]

We will use a dot-notation to indicate differentiation with respect to a parameter. Recall first the parametric score function $\dot{\ell}_{q}$ and the parametric vector score function $\dot{\ell}_{\theta}$ which are the partial derivative and gradient with respect to $q$ and $\theta$ respectively of the log-likelihood in the full model. We have respectively

[TABLE]

with $\dot{R}_{\theta}$ the gradient function of the density $R_{\theta}.$

In the model defined in Equation (5), a straightforward calculation shows that the score function $g$ of the one-dimensional sub-model is such that

[TABLE]

where $a$ and $b$ are the scaling scalar and $k$ -dimensional vector of the parametrizations $q_{t}$ and $\theta_{t}$ respectively, and where $\langle\cdot,\cdot\rangle$ denotes the usual inner product.

Now, we recall briefly the notions of tangent set and efficient score function for the model considered here. The maximal tangent set to the model $\mathcal{P}^{+}$ at $f^{+}$ is the set of all score functions of a one-dimensional sub-model. We denote it $\mathcal{\dot{P}}^{+}$ , and in our case it is given by

[TABLE]

Consider again the path $t\mapsto f_{(q,\theta,F_{t})}^{+}$ related to the model $\mathcal{P}^{+}$ , but now with the parameters $q$ and $\theta$ fixed, then the tangent set at $f^{+}$ for the nonparametric part of the model in (5) is denoted and given by

[TABLE]

The efficient score function related to a given component $\alpha$ of the parameter vector $(q,\theta_{1},\cdots,\theta_{k})$ is then defined component-wise as $\widetilde{\ell}_{\alpha}=\dot{\ell}_{\alpha}-\varPi_{F}\dot{\ell}_{\alpha},$ where $\varPi_{F}$ is the orthogonal projection onto the closure of the linear space spanned by $\mathcal{\dot{P}}_{F}^{+}.$

The expressions of the efficient score functions are given in the following proposition, the coefficients of the efficient Fisher information matrix are displayed in the proof of this proposition.

Proposition 3

The efficient score functions for estimating the parameters $q$ and $\theta$ are given for $x\geq 1$ by

[TABLE]

and

[TABLE]

respectively. The efficient Fisher information $\widetilde{I}$ is a matrix of order $(k+1)$ with coefficients given by equations (31)-(33).

Proof

Recall from (24) the definition of the tangent set of the nonparametric part of the one-dimensional sub-model:

[TABLE]

and let $\overline{lin}(\mathcal{\dot{P}}_{F}^{+})$ denote the closure of the linear space spanned by $\mathcal{\dot{P}}_{F}^{+}$ in $\mathbb{L}^{2}(f^{+}).$ With this notation, recall that the efficient score function related to a given component $\alpha$ of the parameter vector $(q,\theta_{1},\cdots,\theta_{k})$ is defined component-wise as $\widetilde{\ell}_{\alpha}=\dot{\ell}_{\alpha}-\varPi_{F}\dot{\ell}_{\alpha},$ where $\varPi_{F}$ is the orthogonal projection onto $\overline{lin}(\mathcal{\dot{P}}_{F}^{+})$ in $\mathbb{L}^{2}(f^{+})$ .

To reduce clutter, let $\dot{\ell}$ refer to a particular component $\dot{\ell}_{\alpha}$ . We first give a closed form expression of the orthogonal projection. In particular, we have that:

[TABLE]

where $c(\dot{\ell})$ is a constant depending on $\dot{\ell}$ as follow

[TABLE]

To see this, first observe that for every score function $\dot{\ell}$ in the model $\mathcal{P}^{+},$ the projection $\varPi_{F}\dot{\ell}$ is an element of the subspace $\overline{lin}(\mathcal{\dot{P}}_{F}^{+})$ so that it must be a linear combination (or a limit thereof) of elements of the form $\frac{(1-q)FG_{0}}{f}$ for some $G_{0}\in\mathcal{G}$ . Since the latter all vanish on the set $\{1,\ldots,\tau\}$ , so does $\varPi_{F}\dot{\ell}$ .

Next, let $\widetilde{h}$ be any $\mathbb{L}^{2}(f^{+})$ -integrable function that is orthogonal to the space $\overline{lin}(\mathcal{\dot{P}}_{F}^{+})$ , that is:

[TABLE]

In particular, note that such an $\widetilde{h}$ is orthogonal to elements of $\mathcal{\dot{P}}_{F}^{+}$ itself. These, once again, have the form $\frac{(1-q)FG_{0}}{f}$ for some $G_{0}\in\mathcal{G}$ . By design, let us choose $G_{0}$ such that $G_{0}(x_{1})=F(x_{2})$ and $G_{0}(x_{2})=-F(x_{1})$ for $x_{1},x_{2}$ in the support of $F$ and $G_{0}(x)=0$ elsewhere. It is easy to verify that such a choice does indeed lie within $\mathcal{G}$ . On the other hand, the orthogonality of $\widetilde{h}$ and $\frac{(1-q)FG_{0}}{f}$ in $\mathbb{L}^{2}(f^{+})$ implies that:

[TABLE]

or equivalently

[TABLE]

As $F(x)$ is strictly positive over its support, this implies that $\widetilde{h}(x_{1})-\widetilde{h}(x_{2})=0.$ Thus all such $\widetilde{h}$ must be constant on the support of $F$ .

Now let us specialize $\widetilde{h}$ to the components of the efficient score function, by writing them as $\widetilde{\ell}=\dot{\ell}-\varPi_{F}\dot{\ell}$ . Since we have thus determined that $\varPi_{F}\dot{\ell}$ vanishes on $x\leq\tau$ and $\widetilde{\ell}$ is constant over $x>\tau$ , we have therefore established the expression of the projection as in Equation (27) as claimed. To obtain the expression of the constant in Equation (28), we can once again use the fact that $\varPi_{F}\dot{\ell}$ is a linear combination of $(1-q)FG_{0}/f$ for $G_{0}\in\mathcal{G}$ , in addition to the fact that $\sum_{x}FG_{0}=0$ for all such $G$ , to write:

[TABLE]

Now, we can easily compute the efficient score functions using:

[TABLE]

using the expressions of $\dot{\ell}_{q}$ and $\dot{\ell}_{\theta}$ in Equation (22), we explicitly get $\widetilde{\ell}_{q}$ and $\widetilde{\ell}_{\theta}$ . We start with $\widetilde{\ell}_{q}$ . We have:

[TABLE]

We then determine $c(\dot{\ell}_{q})$ from Equation (28),

[TABLE]

and we finally obtain $\widetilde{\ell}_{q}(x)$ as

[TABLE]

Moving on to $\dot{\ell}_{\theta}$ , from Equation (22) and using the fact that $\sum_{x}\dot{R}_{\theta}(x)=0$ , we have:

[TABLE]

Then

[TABLE]

and

[TABLE]

The efficient Fisher information matrix has coefficients defined as

[TABLE]

for all $i,j=1,\dots,k.$ Recall that when we write $\widetilde{\ell}_{\theta}$ , we are referring to a vector of score functions, whereas $\widetilde{\ell}_{\theta_{j}}$ stands for the $j^{th}$ coordinate of $\widetilde{\ell}_{\theta}$ . The computation of these coefficients leads to

[TABLE]

with $\dot{R}_{\theta}^{j}$ the partial derivative of $R_{\theta}$ with respect to the $j^{th}$ coordinate of $\theta.$ $\Box$

A.5 Proof of Theorem 2

We first state and prove two lemmas that will be used for the proof of Theorem 2.

As usual, let $\alpha$ be a component of the parameters vector $(q,\theta)$ , denote by $\alpha_{0}$ the true value of $\alpha$ (if it exists), and let $v(\alpha_{0})$ be a closed neighborhood of $\alpha_{0}$ . We denote by $\mathcal{H}_{\alpha}$ the subset of $\mathbb{L}^{2}(f^{+})$ defined by

[TABLE]

Lemma 1

Let $\widehat{\alpha}$ be a consistent estimator of $\alpha_{0}.$ If $\theta\mapsto R_{\theta}(x)$ is twice continuously differentiable for every $x\leq\tau$ and Assumptions $1$ - $3$ hold, then $\mathcal{H}_{\alpha}$ is a Donsker class with square integrable envelope that contains $\widetilde{\ell}_{\widehat{\alpha}}$ with probability that tends to one.

Proof

We adapt the method used in Example $19.7$ from van der Vaart, (1998). Recall that a $\delta$ -bracket is a subset $[u_{1},u_{2}]$ of $\mathbb{L}^{2}(f^{+})$ such that $\|u_{2}-u_{1}\|_{\mathbb{L}^{2}(f^{+})}<\delta.$ The bracketing number $N\!\left(\delta,\mathcal{H}_{\alpha},\mathbb{L}^{2}(f^{+})\right)$ is the minimum number of $\delta$ -brackets needed to cover $\mathcal{H}_{\alpha}$ and the bracketing entropy is the logarithm of this quantity. To show that $\mathcal{H}_{\alpha}$ is Donsker, we establish the sufficient condition that the square root entropy integral

[TABLE]

is finite. (See Theorem $19.5$ in van der Vaart, (1998).)

We begin by establishing continuity properties of the parametric efficient score functions. From the differentiability of $R_{\theta}$ in $\theta$ and the expressions given in Proposition 3, it is evident that $\widetilde{\ell}_{\alpha}$ is always a differentiable function of $\alpha$ . Let us denote these derivatives by $\dot{\widetilde{\ell}}_{\alpha}$ . For $\alpha=q$ and $\alpha=\theta_{j}$ we can respectively compute these as

[TABLE]

and

[TABLE]

By inspection, we find that $\dot{\widetilde{\ell}}_{q}$ is always continuous itself, and that $\dot{\widetilde{\ell}}_{\theta_{j}}$ is also continuous provided that $\theta\mapsto\mathbb{R}_{\theta}$ is in $\mathcal{C}^{2}$ and $R_{\theta}(x)\geq\eta>0$ for all $x\leq\tau$ , as assumed. These conditions also imply that $\dot{\widetilde{\ell}}_{\alpha}$ have a finite $\mathbb{L}^{2}(f^{+})$ -norm and that these functions are Lipschitz-continuous on $v(\alpha_{0})$ . We thus have a non-negative bounded $V$ such that

[TABLE]

Now, from this Lipschitz condition, it follows that if $|\alpha-\alpha_{1}|<\epsilon$ then $\widetilde{\ell}_{\alpha_{1}}-\epsilon V\leq\widetilde{\ell}_{\alpha}\leq\widetilde{\ell}_{\alpha_{1}}+\epsilon V.$ This means that we need as many $\epsilon$ -balls (a ball with radius $\epsilon/2$ ) to cover $v(\alpha_{0})$ as we need $\delta$ -brackets $(\delta=2\epsilon V)$ to cover $\mathcal{H}_{\alpha}.$ Since the number $n_{0}$ of $\epsilon$ -balls needed to cover $v(\alpha_{0})$ is such that

[TABLE]

with $C$ a constant depending only on $v(\alpha_{0})$ , it follows that the bracketing number is

[TABLE]

Thus the bracketing entropy is of order smaller than $\log(1/\delta)$ , whose square root is integrable near [math]. This establishes the sufficient condition of Equation (35), and thus $\mathcal{H}_{\alpha}$ is indeed Donsker.

To complete the other claims of the proof, note that for all $\alpha$ in $v(\alpha_{0}),$ $\widetilde{\ell}_{\alpha}$ has a finite $\mathbb{L}^{2}(f^{+})$ -norm and that $|\widetilde{\ell}_{\alpha}(x)|\leq U$ for some $U<\infty$ , for all $x\geq 1$ . The boundedness of $\widetilde{\ell}_{\alpha}$ is obtained from the expression of $\widetilde{\ell}_{q}$ and $\widetilde{\ell}_{\theta_{j}}.$ The constant function $U$ is a square integrable envelope for $\mathcal{H}_{\alpha}.$ We use the continuity of the map $\alpha\mapsto\widetilde{\ell}_{\alpha}(x)$ and consistency of $\widehat{\alpha}$ to show that $\lim_{N\rightarrow\infty}\mathbb{P}[|\widetilde{\ell}_{\widehat{\alpha}}(x)-\widetilde{\ell}_{\alpha_{0}}(x)|>\epsilon]=0$ for all $x\geq 1$ . This proves that $\mathcal{H}_{\alpha}$ contains $\widetilde{\ell}_{\widehat{\alpha}}$ with probability that tends to one and the lemma holds. $\Box$

The result in Lemma 1 holds for $\mathcal{H}_{q}$ and $\mathcal{H}_{\theta_{j}}$ for all $j=1,\dots,k$ and thus also for their union $\mathcal{H}.$ We conclude that $\mathcal{H}$ is a Donsker class with square integrable envelope that contains $(\widetilde{\ell}_{\widehat{q}},\widetilde{\ell}_{\widehat{\theta}})$ with probability that tends to one.

Lemma 2

$\widehat{\theta}_{\tau}$ * and $\widehat{q}_{\tau}$ solve the efficient score equations:*

[TABLE]

Proof

Note that the efficient score equation $\sum_{i=1}^{D}\widetilde{\ell}_{q}(x_{i})=0$ leads to equality

[TABLE]

whose solution is $\widehat{q}(\theta)$ given by

[TABLE]

Likewise, if one sets to zero all the partial derivatives of the logarithm of the likelihood in (13), one has

[TABLE]

This equality is equivalent to $\sum_{i=1}^{D}\widetilde{\ell}_{\theta}(x_{i})=0$ with $q$ replaced by $\widehat{q}(\theta)$ in $\widetilde{\ell}_{\theta}.$ The zero notation here refers to the null vector of $\mathbb{R}^{k}$ . $\Box$

We now prove the asymptotic efficiency of the estimators. Note that all results in this proof are stated under the restriction $\tau\leq\tau_{0}$ when necessary.

By Lemma 2, $\widehat{\theta}$ and $\widehat{q}(\widehat{\theta})$ are such that

[TABLE]

As $\widetilde{\ell}_{\theta}$ and $\widetilde{\ell}_{q}$ are free of $F$ and that this is also true for the plug-in estimators $\widetilde{\ell}_{\widehat{\theta}}$ and $\widetilde{\ell}_{\widehat{q}}$ , it is not difficult to verify that

[TABLE]

The asymptotic efficiency of $(\widehat{q},\widehat{\theta})$ follows from Theorem $25.54$ in van der Vaart, (1998). As assumptions, this theorem needs the assertions of Theorem 1 (consistency) and Lemma 1 (Donsker property), in addition to the following two convergence properties pertaining to the “plug-in” score functions. In particular, we need to show that our estimators $(\widehat{q},\widehat{\theta}):$ satisfy:

[TABLE]

and

[TABLE]

where $f^{+}$ stands for $f_{(q_{0},\theta_{0},F)}^{+}$ , $\widehat{f}^{+}$ stands for the parametric plug-in $f_{(\widehat{q},\widehat{\theta},F)}^{+}$ , and $\widetilde{\ell}_{(q_{0},\theta_{0})}$ and $\widetilde{\ell}_{(\widehat{q},\widehat{\theta})}$ are the stacked vectors of $(k+1)$ components, $(\widetilde{\ell}_{q_{0}},\widetilde{\ell}_{\theta_{0}})$ and $(\widetilde{\ell}_{\widehat{q}},\widetilde{\ell}_{\widehat{\theta}})$ respectively.

To establish Equations (44) and (45), we can use for each parameter $\alpha$ the continuity properties of $\widetilde{\ell}_{\alpha}$ per component, as in the proof of Lemma 1. In particular, note first that for all $x>\tau_{0}$ , we have that $\widetilde{\ell}_{\alpha}(x)$ is constant. Therefore, for each parameter $\alpha$ we need only to account for the convergence of $\widetilde{\ell}_{\widehat{\alpha}}(x)\to\widetilde{\ell}_{\alpha}(x)$ for $x=1,\cdots,\tau_{0}+1$ , all of which happen (in probability), by continuity.

It follows that for each $x$ , $\|\widetilde{\ell}_{(\widehat{q},\widehat{\theta})}(x)-\widetilde{\ell}_{(q_{0},\theta_{0})}(x)\|^{2}$ converges to [math] in probability, and since we have only finitely many distinct values, the convergence is uniform for all $x$ . Equation (44) is thus immediate.

On the other hand, $\widehat{f}^{+}(x)\to f^{+}(x)$ in probability for each $x$ . By finiteness, it follows that $\sum_{x\leq\tau_{0}}\widehat{f}^{+}(x)\to\sum_{x\leq\tau_{0}}f^{+}(x)$ , and consequently $\sum_{x>\tau_{0}}\widehat{f}^{+}(x)\ to\sum_{x>\tau_{0}}f^{+}(x)$ . By using once again the fact that $\widetilde{\ell}$ is constant beyond $\tau$ , the convergence reduces again to finitely many convergences, and thus $\sum_{x}\|\widetilde{\ell}_{(q_{0},\theta_{0})}(x)\|_{2}^{2}\widehat{f}^{+}(x)\to\sum_{x}\|\widetilde{\ell}_{(q_{0},\theta_{0})}(x)\|_{2}^{2}f^{+}(x)$ . We can therefore write:

[TABLE]

which completes the proof of Equation (45) and the theorem. $\Box$

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Barger & Bunge, (2008) Barger, Kathryn, & Bunge, John. 2008. Bayesian Estimation of the Number of Species using Noninformative Priors. Biometrical Journal , 50 (6), 1064–1076.
2Ben Hamou et al. , (2017) Ben Hamou, Anna, Boucheron, Stephane, & Ohannessian, Mesrob I. 2017. Concentration Inequalities in the Infinite Urn Scheme for Occupancy Counts and the Missing Mass, with Applications. Bernoulli .
3Böhning & Schün, (2005) Böhning, Dankmar, & Schün, Deter. 2005. Nonparametric maximum likelihood estimation of population size based on the couting distribution. Journal of Royal Statistical Society , 54 (4), 721–737.
4Böhning & van der Heijden, (2009) Böhning, Dankmar, & van der Heijden, Peter G. M. 2009. A covariate adjustment for zero-truncated approaches to estimating the size of hidden and elusive populations. The Annals of Applied Statistics , 3 (2), 595–610.
5Böhning et al. , (2013) Böhning, Dankmar, Vidal-Diez, Alberto, Lerdsuwansri, Rattana, Viwatwongkasem, Chukiat, & Arnol, Mark. 2013. A generalization of Chao’s estimator for covariate information. Biometrics , 69 (4), 1033–1042.
6Bunge & Barger, (2008) Bunge, John, & Barger, Kathryn. 2008. Parametric models for estimating the number of classes. Biometrical Journal , 50 (6), 971–982.
7Chao & Bunge, (2002) Chao, A., & Bunge, J. 2002. Estimating the number of species in a stochastis abundance model. Biometrics , 58 (September), 531–539.
8Chao, (1984) Chao, Anne. 1984. Nonparametric estimation of the number of classes in a population. Scand J Statist , 11 , 265–270.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A truncation model for estimating Species Richness

Abstract

1 Introduction

2 Model and Estimator

2.1 Problem Statement

2.2 Truncation Model

2.3 Estimator of the Number of Species

2.4 Relationship to Other Estimators

Proposition 1

Proposition 2

2.5 Choice of τ\tauτ via Model Selection

3 Analysis of the Estimator

3.1 The semiparametric framework

Definition 1

3.2 Consistency and Efficiency

Assumptions

Theorem 1

Theorem 2

Remark 1

4 Simulations and Experiments

4.1 Number of Species Simulations

4.1.1 Algorithms to compute θ^τ\widehat{\theta}_{\tau}θτ​

Design

Risk approximation using G-L method

Performances of N^τ^\widehat{N}_{\widehat{\tau}}Nτ​

Comparison with other estimators

4.2 Observational richness in text data

5 Conclusion

Appendix A Proofs

A.1 Proof of Proposition 1

A.2 Proof of Proposition 2

A.3 Proof of Theorem 1

A.4 Efficient score functions and efficient Fisher information

Definition 2

Proposition 3

Proof

A.5 Proof of Theorem 2

Lemma 1

Proof

Lemma 2

Proof

2.5 Choice of $\tau$ via Model Selection

4.1.1 Algorithms to compute $\widehat{\theta}_{\tau}$

Performances of $\widehat{N}_{\widehat{\tau}}$