On asymptotically efficient maximum likelihood estimation of linear functionals in Laplace measurement error models
Catia Scricciolo

TL;DR
This paper investigates the conditions under which maximum likelihood estimation can efficiently estimate linear functionals in Laplace measurement error models, revealing fundamental limitations and the absence of regular estimators at parametric rates.
Contribution
It characterizes when the MLE achieves asymptotic efficiency for linear functionals in Laplace deconvolution and demonstrates the fundamental limitations in achieving parametric rates.
Findings
MLE can be asymptotically efficient for some functionals
No regular estimator can achieve parametric rate in general
Estimation often requires slower rates with non-Gaussian limits
Abstract
Maximum likelihood estimation of linear functionals in the inverse problem of deconvolution is considered. Given observations of a random sample from a distribution indexed by a (potentially infinite-dimensional) parameter , which is the distribution of the latent variable in a standard additive Laplace measurement error model, one wants to estimate a linear functional of . Asymptotically efficient maximum likelihood estimation (MLE) of integral linear functionals of the mixing distribution in a convolution model with the Laplace kernel density is investigated. Situations are distinguished in which the functional of interest can be consistently estimated at -rate by the plug-in MLE, which is asymptotically normal and efficient, in the sense of achieving the variance lower bound, from those in which no integral linear functional can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Bayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference
On asymptotically efficient maximum likelihood
estimation of linear functionals in Laplace measurement error models
Catia Scricciolo Catia Scricciolo
Dipartimento di Scienze Economiche, Università degli Studi di Verona, Polo Universitario Santa Marta, Via Cantarane 24, I-37129 Verona (VR), ITALY, [email protected]
Abstract
Maximum likelihood estimation of linear functionals in the inverse problem of deconvolution is considered. Given observations of a random sample from a distribution indexed by a (potentially infinite-dimensional) parameter , which is the distribution of the latent variable in a standard additive Laplace measurement error model, one wants to estimate a linear functional of . Asymptotically efficient maximum likelihood estimation (MLE) of integral linear functionals of the mixing distribution in a convolution model with the Laplace kernel density is investigated. Situations are distinguished in which the functional of interest can be consistently estimated at -rate by the plug-in MLE, which is asymptotically normal and efficient, in the sense of achieving the variance lower bound, from those in which no integral linear functional can be estimated at parametric rate, which precludes any possibility for asymptotic efficiency. The -convergence of the MLE, valid in the case of a degenerate mixing distribution at a single location point, fails in general, as does asymptotic normality. It is shown that there exists no regular estimator sequence for integral linear functionals of the mixing distribution that, when recentered about the estimand and -rescaled, is asymptotically efficient, viz., has Gaussian limit distribution with minimum variance. One can thus only expect estimation with some slower rate and, often, with a non-Gaussian limit distribution.
Keywords:
Asymptotic efficiency Asymptotic normality Laplace convolution model Linear functionals Non-parametric maximum likelihood estimation
MSC:
62G05 62G20 62G30
1 Introduction
The problem of asymptotically efficient estimation of integral linear functionals of the distribution of the latent variable in a standard additive Laplace measurement error model is considered. The focus is on establishing whether asymptotic normality and efficiency hold for the estimator obtained by plugging into the functional of interest the NPMLE of the mixing distribution in a convolution model with the Laplace kernel density. We study the behaviour of the plug-in NPMLE to answer the question of whether there exist integral linear functionals of the mixing distribution that can be consistently estimated by the maximum likelihood method at -rate, the recentered and -rescaled version of the plug-in NPMLE being asymptotically normal with zero mean and minimum variance. Situations are distinguished in which the plug-in NPMLE is consistent at parametric rate and asymptotically efficient, albeit the mixing distribution itself can typically be estimated only at slower rates, from those in which there exists no regular sequence of estimators that can be asymptotically efficient. The model is described hereafter and the problem formally stated.
Model description
Let be a real-valued random variable (r.v.) with distribution defined, for every Borel set on the real line, by the mapping . Suppose that is dominated by Lebesgue measure on , with probability density function (p.d.f.) . Let satisfy the relationship
[TABLE]
where and are (stochastically) independent, unobservable random variables such that has unknown cumulative distribution function (c.d.f.) and has the standard classical Laplace111It is also known as the first law of Laplace to distinguish it from the second law of Laplace, as the normal distribution is sometimes called. It was named after Pierre-Simon Laplace (1749–1827) who, in 1774 (Laplace 1774), obtained , for , as the density of the distribution whose likelihood is maximized when the location parameter is equal to the sample median. or double exponential222It is so called because it is formed by reflecting the exponential distribution around its mean. distribution with scale parameter , in symbols, , whose density has expression
[TABLE]
The density is therefore the convolution of and or a location mixture of Laplace densities with mixing distribution supported on a subset , where stands for a support of , see, e.g., Billingsley (1995), p. 23,
[TABLE]
For ease of exposition, the density of the standard Laplace distribution is considered as a kernel, but the density of any Laplace distribution centered at zero, with known scale parameter , in symbols, , whose variance is ,333To see that , one can take into account that, if and are independent r.v.’s, identically distributed as an exponential with parameter , in symbols, , , then has a distribution. Consequently, . could be employed. Assume that constitute a random sample from . Every then satisfies
[TABLE]
where and are independent samples from the distributions with c.d.f. and p.d.f. , respectively. The r.v.’s are independent and identically distributed (i.i.d.) as and are independent copies of . Realizations of the noisy sample data are observed instead of outcomes of the uncorrupted r.v.’s . The r.v.’s represent additive errors and their distribution is called the error distribution. In this model, the variable of interest cannot be directly observed and empirical access is limited to the sum of and the “noise” . Therefore, estimating the distribution function of , or related quantities like the p.d.f. (if it exists), based on a sample from , accounts for solving a particular inverse problem, called deconvolution, which consists in reconstructing (estimating) from indirect noisy observations drawn from , the latter being the image of under a known transformation that has to be “inverted”. As remarked in Groeneboom and Wellner (1992), p. 4, as well as in Bolthausen et al. (2002), p. 363, the problem can be viewed as a missing data problem: the complete observations would consist of the independent pairs , with ,444The symbol denotes the c.d.f. of , that is, , with density as in (2). but part of the data is missing and only outcomes or realizations of the sums , viewed as transformations of the ’s through the function , are observed.
The statistical model described by relationship (1), with a zero-mean Laplace measurement error r.v. independent of , is a special case of the classical error model , in which is a measurement of in the usual sense, has zero mean and is independent of , see, e.g., Buzas et al. (2005), p. 733, and the references therein. Measurement errors with possibly different structures occur in nearly every discipline from medical statistics to astronomy and econometrics, cf. the monographs of Fuller (1987) and Buonaccorsi (2010). Furthermore, the Laplace distribution finds applications in a variety of disciplines, from image and speech recognition to ocean engineering, see Kotz et al. (2001), chpts. 7–10, pp. 343–397. An application to quality control of the classical Laplace measurement error model is outlined hereafter.
Application to steam generator inspection
An application of the Laplace measurement error model to steam generator inspection and testing is described herein, see Easterling (1980) and Sollier (2017) for more details. Steam generators of pressurized water reactors contain many tubes through which heated water flows. For a variety of reasons, such as corrosion-induced wastage, the steam generator tube integrity can be degraded, the walls becoming thinned or cracked. Leaks may occur during normal operating conditions, thus requiring the plant to be shut down. In order to develop an inspection plan, a statistical model for tube degradation is considered. Experimental data evidentiate that measurements are affected by heavy-tailed and biased errors that can be represented by a r.v. following a Laplace distribution with mean and scale parameter , in symbols, , with density
[TABLE]
Denoted by the actual degradation (extent of thinning) of a tube, expressed as a percentage of the initial tube wall thickness, the measured degradation is modeled as
[TABLE]
where is supposed to be independent of . Assuming that the scale parameter is known and the distribution of possesses probability density function, say , the interest is, in the first place, in estimating the p.d.f. of , based on i.i.d. observations drawn from the distribution of . An exponential distribution for , with scale parameter , in symbols, , whose density has expression
[TABLE]
provides an exponential-double exponential model for the actual degradation , which has proved to have an adequate fit on experimental data. Statistical procedures for fitness-for-service assessment are described in Carroll (2017).
Asymptotic efficiency of the NPMLE for linear functionals of the mixing distribution
For many purposes, interest can lie in only few aspects of the distribution of , key features of which can be represented as linear functionals of . In what follows, symbols and will be used to indicate probability measures (p.m.’s) on , where denotes the Borel -field on , as well as the corresponding cumulative distribution functions, the correct meaning being clear from the context. Letting
[TABLE]
be the collection of all probability measures on , a functional is a mapping that maps every to a real number . The focus is on estimating integral linear functionals
[TABLE]
at the “point” , where the function is given. The following examples illustrate choices of for some common statistical functionals.
- •
Distribution function at a point If, for some fixed , the function , then is the c.d.f. of at the point .
- •
Probability of an interval If, for fixed points , the function , then is the probability of the interval .
- •
Mean If is the identity function on , then is the expected value of which, for any kernel density (not necessarily the Laplace) with zero mean, , is equal to : in fact, from the relationship in (1), it follows that by linearity of the expected value.
- •
th moment If, for any positive integer , the function , then is the th moment of .
- •
Moment generating function If, for some fixed point such that , the mapping , then coincides with the moment generating function (m.g.f.) of at , denoted by or , that is, . Some features of the mixing distribution , like the mean or the variance, can be expressed in terms of the derivatives of the corresponding m.g.f. evaluated at zero. Therefore, in principle, results for estimating aspects of can be obtained as by-products of the inference on .
A standard and principled method for pointwise estimation of linear functionals consists in plugging the555Uniqueness of is not guaranteed, it is therefore with an abuse of language that we refer to the NPMLE throughout the article. NPMLE of into to obtain the plug-in estimator
[TABLE]
A NPMLE of is a measurable function of the observations taking values in , which is not necessarily uniquely defined by the relationship
[TABLE]
equivalently written as
[TABLE]
where
[TABLE]
is the generic location mixture of Laplace densities with mixing distribution and
[TABLE]
is the empirical probability measure associated with the random sample , namely, the discrete uniform distribution on the sample values that puts mass on each one of the observations. In the sequel, for a measurable function , where is specified at the different occurrences, the notation is used to abbreviate the empirical average . Analogously, is used in lieu of . Hereafter, unless it is necessary within the context to specify the integral domain, integration is understood to be performed over the entire natural domain of the integrand. Throughout the article, the probability measure stands for , the joint law of the first coordinate projections of the infinite product probability measure . Sequences of random variables are meant to convergence (in law or in probability) as the sample size grows indefinitely large (as ).
Historical and conceptual background, overview of the results
The deconvolution problem has been intensively studied over the last thirty years. There exists a vast literature on density deconvolution, which accounts for reconstructing/estimating the density of (if it exists) that satisfies the equation
[TABLE]
wherein the kernel density (not necessarily a Laplace) is assumed to be known, based on outcomes of i.i.d. r.v.’s as in (3). We cite key articles of the early 90’s like Carroll and Hall (1988), Stefanski and Carroll (1990), Fan (1991)666For a recent reference list, see also Davidian et al. (2014). that have been ground-breaking and have had a great impact on the area of measurement error, setting the general framework for attacking measurement error/deconvolution problems and developing an approach based on Fourier inversion techniques to construct a deconvolution kernel density estimator for recovering the density of the latent distribution, meanwhile showing how difficult it is to account for measurement errors: in fact, the smoother the error distribution, the stronger its confounding effect on the latent distribution, hence, the slower the optimal attainable rate of convergence for its estimators.
Far less instead seems to be known about distribution function deconvolution, keynote contributions, also based on Fourier inversion techniques, being those of Hall and Lahiri (2008), Dattner et al. (2011), the former article containing an illuminating critical analysis of the background to the problem of distribution estimation in deconvolution problems. Since the focus of this article is on the behaviour of the NPMLE of the mixing distribution , attention is hereafter restricted to review the theory of non-parametric maximum likelihood estimation in deconvolution problems. In general mixture models, the NPMLE of is discrete, with at most support points, being the number of distinct values of the data points, cf. Lindsay (1983). In deconvolution problems with continuous and symmetric (about the origin) kernels decreasing on , a NPMLE always exist (uniqueness is not guaranteed), see Groeneboom and Wellner (1992), Lemma 2.1, pp. 57–58; for kernels that are also strictly convex on , like the (standard) Laplace, the NPMLE is supported on the set of observation points , so that the corresponding probability measure, still denoted by consistently with the notational convention adopted throughout, is concentrated on the range of the data points ,777Following a common notational convention, we denote by the th order statistic.
[TABLE]
see ibid., Corollary 2.2 and Corollary 2.3, p. 59 and p. 60, respectively. Consistency of at a continuous distribution function is proved in ibid., § 4.2, pp. 79–81. More is known about the one-parameter (location only) Laplace model, which can be viewed as a degenerate mixture with point mass mixing distribution at some fixed , the Dirac measure at .888The Dirac measure at , denoted by , is defined on the Borel sets by if or , respectively. A simple maximization argument to find a MLE of the location parameter , denoted by , is given in Norton (1984): the sample median is a MLE which is an M-estimator, see Huber (1967), solving the equation .999The sign-function is defined as if , or , respectively. Therefore, a MLE exists, but may not be unique: if is odd, that is, for some , then the sample median is uniquely defined as the middle observation , while, if is even, then there are two middle observations and so that, in principle, any value in the interval could be chosen, even if the canonical median , the average of the middle observations, is typically used in practice. Therefore,
[TABLE]
The sample median is a MLE and is asymptotically efficient, that is, consistent and, when recentered at and -rescaled, asymptotically normal with zero mean and variance equal to one, which is the information lower bound corresponding to the amount of information in a single observation,
[TABLE]
where “” denotes convergence in law. This has been established by Daniels (1961) who, motivated by the non-differentiability at zero of the (standard) Laplace density, proved a general theorem on the asymptotic efficiency of the MLE under conditions not involving the second and higher-order derivatives of the likelihood function. Even though, as later on noted by Huber (1967), a crucial step has been overlooked in Daniels’ proof, the assertion remains valid.
In this article, we study the behaviour of the plug-in NPMLE to answer the question of whether there exist integral linear functionals that can be consistently estimated by the maximum likelihood method at -rate and for which
[TABLE]
is asymptotically efficient, in the sense of Definition 2.8 in Bolthausen et al. (2002), p. 349, that is, asymptotically normal with zero mean and minimum variance. In fact, also in non-parametric problems estimation can be performed at -rate and, in general mixture models , where is any (not necessarily the Laplace) kernel density, there may exist linear functionals of that are estimable at parametric rate, even if itself can be pointwise estimated only at slower rates. Fundamental contributions developing the theory of information bounds are van der Vaart (1991), van der Vaart (1998), chpt. 25, pp. 358–432, Bolthausen et al. (2002), Part III, pp. 331–457, Groeneboom and Wellner (1992), with emphasis on non-parametric maximum likelihood estimation, and van de Geer (2000), chpt. 11, pp. 211–246, with a focus on asymptotic efficiency of the NPMLE in mixture models. To exemplify the issue, consider estimating the mean functional which, in a mixture model such that , is equal to . Then, the sample mean is a -consistent and, after -rescaling, asymptotically normal estimator of , but may not be a MLE; furthermore, it does not take into account the information that the sampling density is a mixture. On other side, the MLE may be -consistent and converge to a normal distribution with smaller variance than that of the sample mean. This is the case for the sample median in the single-parameter (location only) Laplace model, see van der Vaart (1998), Example 7.8 on location models, p. 96. Surprisingly, little is known in general about the asymptotic behaviour of the plug-in NPMLE for linear functionals in Laplace convolution mixtures, even if only for estimating the mean functional. Although the topic is useful, existence of this gap can be partially explained by the fact that the Laplace or double-exponential distribution is not an exponential family model so that standard results may not be valid or immediately available from the theory of exponential families.
In order to investigate whether integral linear functionals of the mixing distribution in a convolution model with the Laplace kernel density are estimable at -rate, we appeal to van der Vaart’s differentiability theorem, see Theorem 3.1 in van der Vaart (1991), p. 183, a general result that allows for a unified treatment of the information lower bound theory based on the concept of a differentiable functional, see, for a definition of the latter, display (2.2) in van der Vaart (1991), p. 180, or Definition 1.10 in Bolthausen et al. (2002), p. 343. The differentiability theorem characterizes differentiable functionals and, by combining the description of the set of differentiable functionals with a result stating that the existence of regular estimator sequences for a functional implies its differentiability, provides a way to distinguish situations in which the functional of interest is estimable at -rate from situations in which this is not the case, see van der Vaart (1998), p. 365, for a definition of a regular estimator sequence. A necessary and sufficient condition for differentiability of a (not necessarily linear) functional is that its gradients are contained in the range of the adjoint of the score operator, where the score operator can be viewed as a derivative (in quadratic mean) of the map , see (3.6) in van der Vaart (1991), p. 183, or (25.29) in § 25.5 of van der Vaart (1998), p. 372. As previously mentioned, differentiability is necessary for regular estimability of a functional or, equivalently, for the existence of regular estimator sequences, see Theorem 2.1 of van der Vaart (1991), p. 181, so that if the functional is not differentiable, then there exists no regular estimators and estimation at -rate is impossible. Interestingly, for real-valued functionals, the differentiability condition is equivalent to having positive efficient information, see Theorem 4.1 of van der Vaart (1991), pp. 186–187. We find that the differentiability condition fails for integral linear functionals of the mixing distribution in a convolution model with the Laplace density, this implying that there exists no estimator sequence for that is regular at and estimation at -rate is impossible.
Organization
The rest of the article is organized as follows. The main results are presented in Sect. 2, which is split into two parts. In the first one, asymptotic efficiency of the plug-in NPMLE for integral linear functionals of the mixing distribution in a convolution model with the (one-sided) exponential kernel density is analysed and set in the affirmative. Construction of interval estimators and tests based on a Studentized version of the plug-in NPMLE, when the asymptotic variance is consistently estimated, is revisited. Conditions for extending results to non-linear functionals are discussed as a side-issue. In the second part, the focus is on asymptotically efficient estimation by the plug-in NPMLE for integral linear functionals of the mixing distribution in a convolution model with the double-exponential (Laplace) kernel density. It is shown that, except for the case of a degenerate mixing distribution at a single location point, maximum likelihood estimation completely fails, in the sense that no integral linear functional can be estimated at -rate, which precludes any possibility for the NPMLE of being asymptotically efficient. Indeed, there exists no regular sequence of estimators for integral linear functionals of the mixing distribution that can be asymptotically efficient, therefore, estimation of linear functionals is impossible at -rate. Final remarks and comments are exposed in Sect. 3. Proofs of the main results are deferred to the appendices: Appendix A reports the proof of the result for convolution with the exponential density, Appendix B reports the proof of the result for convolution with the Laplace density.
2 Main results
In this section, the main results of the article are presented. First, the case of a convolution model with an exponential kernel density is considered. Since, as previously noted, the Laplace density in (2) can be thought of as two exponential densities spliced together back-to-back, the positive half being a standard exponential density scaled by , it is reasonable to begin the analysis from the problem of asymptotically efficient non-parametric maximum likelihood estimation of linear functionals in a convolution model with the exponential kernel density. A preliminary study of the one-sided problem, beyond being of interest in itself, is useful to attack the two-sided one by partially reducing it to the previously solved case; it may furthermore provide insight for a better understanding of the reasons why symmetrization leads to a failure of asymptotically efficient estimation of linear functionals in the double-exponential (Laplace) case.
Convolution with the exponential density
In this paragraph, a standard exponential kernel density on is considered. This gives rise to a one-sided mixture density generating the data,
[TABLE]
where is assumed, without loss of generality, to be a proper, left-closed subset of the real line, and is the support of , with . Proposition 1 below establishes that, under sufficient conditions, an integral linear functional can be consistently estimated at -rate by the plug-in NPMLE , which, when recentered about the estimand and -rescaled, is asymptotically normal and efficient. In stating hereafter the assumptions, Newton’s notation (or the dot notation) for differentiation is adopted, that is, .
Assumptions
, 2.
, 3.
, 4.
(i) is continuous on ,
(ii) either is compact or is bounded on ,
(iii) , 5.
there exists on and , 6.
there exists a constant such that
[TABLE]
Some remarks and comments on the above listed assumptions are in order. Except for Assumption , which concerns the plug-in NPMLE , all other assumptions involve the function and/or the mixing distribution that jointly define the functional . Specifically, Assumption guarantees that is well defined. Assumption ensures the existence of the moment generating function of at the point . Assumption requires consistency of at . If converges weakly to in -probability, then parts (i) and (ii) of Assumption together imply Assumption because is continuous and bounded on . Sufficient conditions for to converge weakly to in the convolution model with the standard exponential kernel density on are stated in Groeneboom and Wellner (1992), p. 86; see also Theorem 2.3 of Chen (2017), p. 54, for sufficient conditions for strong consistency of in general mixture models. Part (ii) of Assumption postulates that either is a closed and bounded interval or is a right-unbounded interval and is bounded. Assumption requires to be differentiable and bounded on , which, in particular, accounts for to be right-differentiable at , that is, ,101010The right-derivative of at , denoted by , is defined as the one-sided limit if it exists as a real number. and, in the case where , to be also left-differentiable at , that is, .111111The left-derivative of at , denoted by , is defined as the one-sided limit if it exists as a real number. Assumption plays its role in the proof of Proposition 1 when bounding the worst possible sub-directions, see (Proof of Proposition 1) in the Appendix A.
Proposition 1
*Under Assumptions *(A0)**–(A5)**, we have
[TABLE]
where the mapping , defined as
[TABLE]
is the efficient influence function, whose squared -norm
[TABLE]
is the efficient asymptotic variance.
Proposition 1 establishes that, under sufficient conditions listed as Assumptions (A0)–(A5), cf. Lemma 4.6 of van de Geer (2003), p. 461, and van der Geer (2000), p. 231, the plug-in NPMLE consistently estimates at -rate; furthermore, when recentered at and -rescaled, it is asymptotically distributed as a zero-mean Gaussian, with variance attaining the lower bound given by the squared -norm of the efficient influence function, which plays here the same role as the normalized score function for the case of independent sampling from a parametric model , , ,
[TABLE]
where is the score function of the model and the Fisher information matrix for . In a parametric set-up, the minimum variance lower bound reduces to the Cramér-Rao bound, which states that the inverse of the Fisher information matrix is a lower bound on the variance of any -rescaled unbiased estimator of , in symbols, . Therefore, the counterpart of is, by symmetry of ( hence of) ,
[TABLE]
In general, considered a function that maps into , , and denoted by the derivative of , the matrix is a lower bound on the variance of any -rescaled unbiased estimator of .
Even if a statement of the result in Proposition 1 appears in Lemma 4.6 of van der Geer (2003), p. 461, as far as we are aware, a complete derivation of the assertion is not available in the literature, cf. also van der Geer (2000), p. 231, so the proof reported in the Appendix A might prove helpful. The underlying idea is outlined hereafter. A NPMLE solves the likelihood equation for every path , with , starting at a fixed point corresponding to and direction such that , that is, for every parametric sub-model which passes (at ) through it. In symbols,
[TABLE]
For ease of notation, let
[TABLE]
Equation (8) reduces to , where is the score function (at ) in an “information loss model”, see, e.g., § 25.5.2 in van der Vaart (1998), pp. 374–375. If dominates , which, however, is seldom true, then so that . Asymptotic equicontinuity arguments then yield that because , namely, the score has zero mean. So, . Asymptotic normality follows. The reader is referred to Sect. 11.2 of van de Geer (2000), pp. 211–246, for a more comprehensive treatment of the topic taking into account technical difficulties to which it cannot be here dedicated the necessary space.
Remark 1
For simplicity, a convolution model with an exponential kernel density on having intensity has been considered (we warn the reader of the clash of notation with the symbol previously used to denote Lebesgue measure on the real line), but, as revealed by an inspection of the proof of Proposition 1, the assertion holds true for every .
Remark 2
Part (i) of Assumption (A3) requires to be continuous on , which is not true for indicator functions, therefore Proposition 1 does not apply to pointwise estimation of the c.d.f. nor to the estimation of the probability of an interval, so that it cannot be concluded that these functionals are estimable at -rate by the corresponding plug-in NPMLE’s. Indeed, can be pointwise estimated only at -rate, see Groeneboom and Wellner (1992), p. 121. Part (ii) of Assumption (A3) and Assumption (A4) require that both and are bounded on , which, for example, may not be true for the functions and that define the th moment and the moment generating function of at the point , respectively: in fact, both and , as well as their first derivatives, are continuous on the half-line , but not bounded therein. Nonetheless, boundedness can be retrieved by restriction to a compact domain. Therefore, if, besides Assumptions (A0)–(A2) and (A5), it also holds that has compact support, then, by Proposition 1, it can be concluded that and are consistently and efficiently estimated at -rate by their respective plug-in NPMLE’s.
Although Proposition 1 asserts that certain integral linear functionals can be consistently estimated at -rate by the plug-in NPMLE , which, when recentered at and -rescaled, is asymptotically normal and efficient, two orders of problems may arise that can make it difficult to employ the result for statistical inference:
a)
computation of the NPMLE ,
b)
dependence of the variance on the unknown sampling distribution .
As for the former difficulty, although the NPMLE can be found by a one-step procedure computing the slope of the convex minorant of a certain function, cf. Groeneboom and Wellner (1992), pp. 62–63 (see also Vardi (1989) for a different approach), as a by-product of Theorem 11.8 of van de Geer (2000), p. 217, which the assertion of Proposition 1 relies on, the recentered and -rescaled plug-in NPMLE is equivalent, in the sense of being asymptotically approximable, up to an -error term, by the empirical average of the efficient influence function. This is part of a general issue concerning the fact that sequences of efficient estimators for functionals are asymptotically approximable by an empirical average of the efficient influence function, see, e.g., Lemma 2.9 in Bolthausen et al. (2002), p. 349. In fact, set the position
[TABLE]
and noted that, from the definition of in (6), the term appearing in (5) writes as , we have
[TABLE]
Thus, both and are asymptotically normal and efficient. Moreover, estimators arising from may coincide with simple naïve estimators. For example,
- •
if , from (9), for , we get the estimator , which is the one we would suggest considering that ;
- •
if for any fixed such that , then the estimator derived from (9), for , is , which is the one we would suggest taking into account that , where , , is the m.g.f. of a standard exponential r.v. . So, letting , , be the empirical m.g.f. for the random sample , it turns out that .
As for the difficulty listed in b), by the plug-in approach, replacing the asymptotic variance with a consistent estimator leads to the following assertion.
Corollary 1
Under the conditions of Proposition 1, if, in addition, , where “” denotes convergence in -probability, then
[TABLE]
Replaced the efficient asymptotic variance in (7) with a consistent sequence of estimators, asymptotic normality of the Studentized version of allows to carry out pointwise inference on linear functionals by interval estimation or hypotheses testing constructing confidence intervals or tests, respectively. For every , let be the -quantile of a standard normal distribution, i.e., , where stands for the c.d.f. of a standard normal. Then,
[TABLE]
is an approximate -level confidence interval for .
Remark 3
Asymptotic normality of the plug-in NPMLE for linear functionals of the mixing distribution can be employed to establish asymptotic normality for non-linear functionals. Suppose, for instance, that is defined as
[TABLE]
where the function has non-zero derivative at denoted by . Asymptotic normality of then follows from asymptotic normality of by the delta method, see, e.g., chpt. 3 in van der Vaart (1998), pp. 25–34. So, if the convergence in (5) takes place, then
[TABLE]
where efficiency of carries over into efficiency of , see ibid., p. 386, for details.
Alternatively, set the position
[TABLE]
under the condition
[TABLE]
which requires that, in probability, behaves asymptotically as , after -rescaling, the two differences have the same limiting distribution. In fact, if the convergence in (5) takes place, then Slutsky’s lemma implies that
[TABLE]
see also the Remark of van de Geer (2000) on p. 223.
Convolution with the double-exponential (Laplace) density
In this paragraph, the case of main interest of the article concerning asymptotically efficient maximum likelihood estimation of linear functionals of the mixing distribution in a convolution model with the (standard) Laplace kernel density is considered. It has been recalled in Sect. 1 that, for a one-parameter (location only) Laplace model, the sample median is a MLE, consistent and asymptotically efficient, even if, for small sample sizes, it may not be the best estimator to use because there exist other unbiased estimators with smaller variances, which are therefore more efficient, see, e.g., Remark 2.6.2 in Kotz et al. (2001), p. 82. More precisely, for a sample of odd size from a general distribution, the variance of is equal to
[TABLE]
where is the density of a standard Laplace distribution as defined in (2), while the asymptotic variance is equal to
[TABLE]
It is just the case to observe that also the sample mean is asymptotically normal with mean , but the asymptotic relative efficiency (ARE) of the median to the mean, namely, the ratio of the variance of the sample mean to the asymptotic variance of the sample median equals :
[TABLE]
On a side note, we recall that, for any function differentiable at , with derivative , the plug-in MLE is also asymptotically efficient, with
[TABLE]
see, e.g., Lehmann and Casella (1998), p. 440.
In what follows, we aim at giving results on asymptotically efficient maximum likelihood estimation of linear functionals of the mixing distribution, beyond the case of a degenerate mixing distribution localized at a point on the real line. As recalled in Sect. 1, in the deconvolution problem with the Laplace kernel density, a NPMLE always exists and consistency at a continuous distribution function holds, but little is known about the asymptotic behaviour of the plug-in NPMLE for linear functionals. The following proposition states that, except for the above recalled degenerate case, estimation of integral linear functionals at -rate is impossible.
Proposition 2
Let be a non-degenerate probability measure supported on . Let be any integral linear functional evaluated at . Then, there exists no estimator sequence for that is regular at .
Some comments on Proposition 2, whose proof is deferred to the Appendix B, are in order. It states that no integral linear functional is estimable at parametric rate, in particular, by the plug-in NPMLE . One can thus expect estimation, performed by any method, only at slower rates and, possibly, with a non-Gaussian limiting distribution, even if the theorem we invoke to establish Proposition 2 does not give any indication about which rates to expect when estimation at -rate fails, an issue that requires further investigation. A related open question concerns the possible extension of the negative result of Proposition 2 to convolution models with general kernel densities that are symmetric about zero, but not differentiable at it, a feature that seems to play a crucial role in causing failure of estimation at parametric rate. To sum-up, only in the case of a degenerate mixing distribution at a point , the MLE is asymptotically efficient for the location parameter and the plug-in MLE is asymptotically efficient for any , with differentiable at .
3 Final remarks
In this article, we have studied asymptotically efficient maximum likelihood estimation of linear functionals of the mixing distribution in a standard additive measurement error model, when the error has either the exponential or Laplace distribution. In the former case, the plug-in NPMLE of certain linear functionals is -consistent, asymptotically normal, efficient and equivalent to naïve estimators that are empirical averages of a given transformation of the observations. In the latter case, instead, even if the kernel is generated by symmetrization about the origin of the exponential density, left aside the degenerate case of a single Laplace model in which the MLE, the sample median, is asymptotically efficient for the location parameter, asymptotically efficient estimation of linear functionals completely fails, in the sense that estimation at -rate is impossible for linear functionals of non-degenerate mixing distributions. An open question then is whether this negative result extends to general kernel densities symmetric about zero, but not differentiable at zero, a feature that seems to play a crucial role in causing the failure.
Appendix A
In this section, we present the proof of Proposition 1 on the asymptotic efficiency of the plug-in NPMLE for integral linear functionals of the mixing distribution in a convolution model with the exponential kernel density on .
Proof of Proposition 1
We appeal to Theorem 2.1 of van de Geer (1997), p. 21 (see also Theorem 11.8 of van de Geer (2000), pp. 217–220, for a slightly more general version) and, in showing that Conditions 1–4 are satisfied, we follow the indications exposed in Sect. 3, ibid., pp. 24–27.
Verification of Condition 1. (Consistency and rates).
Under Assumption that , the MLE converges in the Hellinger distance , defined as the -distance between the square-root densities, at the rate . In symbols, for ,
[TABLE]
The result can be obtained by applying Theorem 7.4 in van de Geer (2000), pp. 99–100, see also ibid., p. 124. As a consequence, see, e.g., Corollary 7.5, ibid., p. 100,
[TABLE]
where . Consistency of is guaranteed by Assumption .
Verification of Condition 2. (Existence of the worst possible sub-directions and efficient influence functions. Differentiability of in a neighborhood of ).
For real numbers and , let
[TABLE]
be a Hellinger-type ball centered at with radius . For every and , let . We prove
- (a)
existence of the worst possible sub-directions such that and ; 2. (b)
existence of the efficient influence functions , where ; 3. (c)
differentiability of at :
[TABLE]
where .
For every , we prove the existence of such that the corresponding satisfies for -almost all ’s. We proceed by first deriving the expression of as a solution of (11) and then proving the existence of the corresponding worst possible sub-direction as required in (a) and (b). The function has to satisfy
[TABLE]
where , for -almost all ’s. Differentiating both sides of (12) with respect to , we get
[TABLE]
Using constraint (12) in (13), we obtain that , whence . The solution is unique up to sets of -measure zero. By an extension to , is then defined as in (6).
(a) Existence of such that .
Recall that
[TABLE]
Defined the function as
[TABLE]
integration by parts yields that
[TABLE]
because and by part (iii) of Assumption combined with the fact that . Analogously, since ,
[TABLE]
Then, defined the mapping as
[TABLE]
by previous computations, we have that
[TABLE]
In order to check that has expected value , it suffices to note that, by applying twice the tower rule and using equalities (15) and (12),
[TABLE]
Next, we show that, for every and ,
[TABLE]
which implies that . Noting that
[TABLE]
we can rewrite in (14) as
[TABLE]
To conclude that is bounded on , we observe two facts. First,
[TABLE]
where by Assumption and by parts (i) and (ii) of Assumption . Second, for every and ,
[TABLE]
where exists because dominates . The bound in (17) holds uniformly over . Therefore,
[TABLE]
by Assumptions , , – and the fact that is bounded by a constant on .
(b)–(c) Definition of and differentiability of at .
The function defined in (6), which solves equation (12), is such that for every , in virtue of (15).
Verification of Condition 3. (Control on the worst possible sub-directions ).
Recall that in (Proof of Proposition 1) is bounded by on . Besides, the factor , which diverges to as , is counterbalanced by . There thus exists a positive constant such that
[TABLE]
Verification of Condition 4. (Control on the efficient influence functions ).
The information for estimating is positive and finite, . Also, the influence functions are uniformly bounded. In fact, for every , we have so that
[TABLE]
by Assumptions , (parts (i) and (ii)) and .
Next, to show that relationships (2.10) and (2.11) in van de Geer (1997), p. 21, are satisfied, we follow the reasoning illustrated in Sect. 3.4, ibid., pp. 26–27, and check that, for some positive sequence ,
[TABLE]
where is the set obtained from in (10) by replacing with . Note that . Using integration by parts, together with conditions (i) and (ii) of Assumption , which jointly guarantee that is bounded on , as well as the fact that every has the same support as , we find that . The latter integral can be bounded above by applying inequality (30) in Scricciolo (2018), p. 358, which relates the -Wasserstein or Kantorovich distance between distribution functions and to the Hellinger distance between the corresponding mixtures (of exponential densities) ,
[TABLE]
where “” indicates inequality valid up to a constant multiple that is universal or fixed within the context, but anyway inessential for our purposes because the bound is uniform over . The inequality is obtained by setting and , the latter value being determined by condition (29), ibid., p. 358, on the Fourier transform of a standard exponential density. By Assumption , which guarantees that is bounded on , and inequality (20), we have
[TABLE]
where because on . The limit in (19) follows.
It remains to check that, for the collection of functions , the bracketing integral
[TABLE]
where is the -bracketing number of for the -metric, namely, the smallest number of -brackets needed to cover , see, e.g., 2.1.6 Definition (Bracketing numbers) in van der Vaart and Wellner (1996), p. 83, or Definition 2.2 in van der Geer (2000), p. 16. Under Assumption (parts (i) and (ii)) and Assumption , by the same arguments as before, the -distance between the lower and upper functions and of every bracket can be bounded above as follows:
[TABLE]
By 2.7.5 Theorem in van der Vaart and Wellner (1996), pp. 159–162, the bracketing entropy of the class of all uniformly bounded, monotone functions on the real line is of the order . Therefore, and the integral in (21) is finite. The proof of Condition 4 is thus complete.
The conclusion of Theorem 2.1 follows:
[TABLE]
where has expected value , as it can be deduced from (Proof of Proposition 1) when . Hence,
[TABLE]
and the proof is complete. ∎
Appendix B
In this section, we present the proof of Proposition 2 which states that no integral linear functional of a non-degenerate mixing distribution in a convolution model with the Laplace kernel density is estimable at parametric rate, in particular, by the maximum likelihood method.
Proof of Proposition 2
We let, at the outset, be any integral linear functional, as defined in (4), evaluated at the “point” . Arguments are laid down to identify functions (if any) whose corresponding functionals are estimable at -rate. To the aim, we appeal to van der Vaart’s differentiability theorem, which provides a necessary and sufficient condition for pathwise differentiability of a (not necessarily linear) functional, see Theorem 3.1, Corollaries 3.2, 3.3 and Lemma 3.4 of van der Vaart (1991), pp. 183–185, or Theorem 3.1, Corollaries 3.1, 3.2 and Proposition 3.1 in Groeneboom and Wellner (1992), pp. 24–28. If differentiability of a functional fails, then, by Theorem 2.1 of van der Vaart (1991), p. 181, the functional is not estimable at -rate, see also chpt. 25 in van der Vaart (1998), pp. 358–432. A necessary and sufficient condition for differentiability of an integral linear functional is that, for , there exists a function , with , satisfying
[TABLE]
explicitly,
[TABLE]
where the conditional density of , given , is , see § 7 in van der Vaart (1991), pp. 189–191, or Example 3.2 in Groeneboom and Wellner (1992), pp. 30–31. If an integral linear functional is regularly estimable, then the condition in (22) must be necessarily satisfied and a regular estimator for is given by . The following arguments are aimed at deriving the expression of . Let be fixed. For a function such that
[TABLE]
where, in the case when is bounded, (hence its derivative ) is taken to be identically equal to zero on so that the limit is automatically verified, integration by parts yields that
[TABLE]
whence
[TABLE]
The integral analogous to the one on the left-hand side of (Proof of Proposition 2), but with the right branch of the Laplace density, can be dealt with similarly. For some satisfying the limit in (23) and also
[TABLE]
where the same proviso on and applies for the case when is bounded, we get
[TABLE]
whence
[TABLE]
Summing side by side (25) and (26) and subtracting on both sides of the resulting equation, we obtain
[TABLE]
In order to get rid of the dependence of the function on , the derivative must be equal to zero, which means that is identically equal to a constant on and the functional is trivially equal to the constant itself. Conclude that there exists no integral linear functional of a non-degenerate mixing distribution that can be estimated at -rate. This completes the proof. ∎
References
Billingsley P (1995) Probability and measure. Wiley, New York, 3rd edition
Bolthausen E, Perkins E, van der Vaart A (2002) Lectures on probability theory and statistics. Ecole d’Eté de Probabilités de Saint-Flour XXIX – 1999. Bernard P (ed) Lecture Notes in Mathematics, Vol 1781. Springer-Verlag, Berlin, pp 331–457
Buonaccorsi, JP (2010) Measurement error: models, methods, and applications. Chapman & Hall/CRC Press, Boca Raton, FL
Buzas JS, Stefanski LA, Tosteson TD (2005) Measurement error. In: Ahrens W, Pigeot I (eds) Handbook of epidemiology. Springer-Verlag, Berlin, Heidelberg, pp 729–765
Carroll LB (2017) Nuclear steam generator fitness-for-service assessment. In: Riznic J (ed) Steam generators for nuclear power plants. Woodhead Publishing, pp 511–523
Carroll RJ, Hall P (1988) Optimal rates of convergence for deconvolving a density. J Amer Statist Assoc 83:1184–1186
Chen J (2017) Consistency of the MLE under mixture models. Stat Sci 32:47–63
Daniels HE (1961) The asymptotic efficiency of a maximum likelihood estimator. In: Proc Fourth Berkeley Symp on Math Statist and Prob, Vol 1. Univ of Calif Press, pp 151–163
Dattner I, Goldenshluger A, Juditsky A (2011) On deconvolution of distribution functions. Ann Stat 39:2477–2501
Davidian M, Lin X, Morris JS, Stefanski LA (2014) The work of Raymond J. Carroll: The impact and influence of a statistician. Springer International Publishing, Switzerland
Easterling RG (1980) Statistical analysis of steam generator inspection plans and eddy current testing. Washington, DC: Division of Operating Reactors, Office of Nuclear Reactor Regulation, US Nuclear Regulatory Commission
Fan J (1991) On the optimal rates of convergence for nonparametric deconvolution problems. Ann Stat 19:1257–1272
Fuller WA (1987) Measurement error models. John Wiley, New York
Groeneboom P, Wellner JA (1992) Information bounds and nonparametric maximum likelihood estimation. Birkhäuser, Basel
Hall P, Lahiri SN (2008) Estimation of distributions, moments and quantiles in deconvolution problems. Ann Stat 36:2110–2134
Huber PJ (1967) The behavior of maximum likelihood estimates under nonstandard conditions. In: Proc Fifth Berkeley Symp on Math Statist and Prob, Vol 1. Univ of Calif Press, pp 221–233
Kotz S, Kozubowski TJ, Podgórski K (2001) The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance. Birkhäuser, Boston
Laplace P-S (1774) Mémoire sur la probabilité des causes par les événements. Mém Acad Roy Sci Paris (Savants étrangers) Tome VI:621–656
Lehmann EL, Casella G (1998) Theory of point estimation, 2nd ed. Springer-Verlag, New York
Lindsay BG (1983) The geometry of mixture likelihoods: a general theory. Ann Stat 11:86–94
Norton RM (1984) The double exponential distribution: using calculus to find a maximum likelihood estimator. Am Stat 38:135–136
Scricciolo C (2018) Bayes and maximum likelihood for -Wasserstein deconvolution of Laplace mixtures. Stat Methods Appl 27:333–362
Sollier T (2017) Nuclear steam generator inspection and testing. In: Riznic J (ed) Steam generators for nuclear power plants. Woodhead Publishing, pp 471–493
Stefanski L, Carroll RJ (1990) Deconvoluting kernel density estimators. Statistics 21:169–184
van de Geer S (1997) Asymptotic normality in mixture models. ESAIM Probab Stat 1:17–33
van de Geer SA (2000) Empirical processes in M-estimation. Cambridge University Press, Cambridge
van de Geer S (2003) Asymptotic theory for maximum likelihood in nonparametric mixture models. Comput Stat Data An 41:453–464
van der Vaart A (1991) On differentiable functionals. Ann Stat 19:178–204
van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge
van der Vaart AW, Wellner JA (1996) Weak convergence and empirical processes. Springer-Verlag, New York
Vardi Y (1989) Multiplicative censoring, renewal processes, deconvolution and decreasing density: nonparametric estimation. Biometrika 76:751–761
