Density deconvolution under general assumptions on the distribution of measurement errors
Denis Belomestny, Alexander Goldenshluger

TL;DR
This paper develops a flexible method for density deconvolution that works under broad conditions on measurement error distributions, including cases with zeros in their characteristic functions, improving estimation robustness.
Contribution
It introduces a novel approach for density deconvolution that handles general error distributions, relaxing the common zero-free characteristic function assumption.
Findings
Derived upper bounds on estimator risk.
Provided conditions where zeros in characteristic functions do not affect accuracy.
Showed conditions are necessary in certain cases.
Abstract
In this paper we study the problem of density deconvolution under general assumptions on the measurement error distribution. Typically deconvolution estimators are constructed using Fourier transform techniques, and it is assumed that the characteristic function of the measurement errors does not have zeros on the real line. This assumption is rather strong and is not fulfilled in many cases of interest. In this paper we develop a methodology for constructing optimal density deconvolution estimators in the general setting that covers vanishing and non--vanishing characteristic functions of the measurement errors. We derive upper bounds on the risk of the proposed estimators and provide sufficient conditions under which zeros of the corresponding characteristic function have no effect on estimation accuracy. Moreover, we show that the derived conditions are also necessary in some…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Density deconvolution under general assumptions
on the distribution of measurement errors
Denis Belomestnylabel=e1][email protected] [
Alexander Goldenshlugerlabel=e2][email protected] [ Duisburg-Essen University\thanksmarkm1, Higher School of Economics\thanksmarkm3
and University of Haifa\thanksmarkm2
Faculty of Mathematics
Duisburg-Essen University
Thea-Leymann-Str. 9
D-45127 Essen
Germany
E-mail:
Department of Statistics
University of Haifa
Haifa 3498838
Israel
National University
Higher School of Economics
11 Pokrovsky Bulvar, Pokrovka Complex
Moscow, Russia
Abstract
In this paper we study the problem of density deconvolution under general assumptions on the measurement error distribution. Typically deconvolution estimators are constructed using Fourier transform techniques, and it is assumed that the characteristic function of the measurement errors does not have zeros on the real line. This assumption is rather strong and is not fulfilled in many cases of interest. In this paper we develop a methodology for constructing optimal density deconvolution estimators in the general setting that covers vanishing and non–vanishing characteristic functions of the measurement errors. We derive upper bounds on the risk of the proposed estimators and provide sufficient conditions under which zeros of the corresponding characteristic function have no effect on estimation accuracy. Moreover, we show that the derived conditions are also necessary in some specific problem instances.
62G07,
62G20,
density deconvolution,
minimax risk,
characteristic function,
Laplace transform,
lower bounds,
density estimation,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs
and
1 Introduction
1.1 Problem formulation and background
The problem of density deconvolution is formulated as follows. Suppose that we observe a random sample generated from the model
[TABLE]
where are i.i.d. random variables with density , and the measurement errors are i.i.d. random variables with known distribution . Furthermore, assume that are independent of . Then the probability density of is given by convolution
[TABLE]
The goal is to estimate from the observations .
An estimator of is a measurable function of observations , and the accuracy of is measured by the maximal risk
[TABLE]
where is a loss function, is a class of density functions, and is the expectation with respect to the probability measure of observations when the density of is . In this paper we will be interested in estimating at a single point and in the –norm; this corresponds to the loss functions and \Delta_{2}(f_{1},f_{2}):=\|f_{1}-f_{2}\|_{2}=\big{\{}\int_{-\infty}^{\infty}|f_{1}(x)-f_{2}(x)|^{2}\mathrm{d}x\big{\}}^{1/2}, respectively. The minimax risk is then defined by
[TABLE]
where is taken over all possible estimators. An estimator is called rate–optimal if as , and our goal is to construct rate–optimal estimators for natural functional classes of densities.
The problem of density deconvolution has been extensively studied in the literature; see, e.g., Carroll & Hall (1988), Zhang (1990), Fan (1991), Butucea & Tsybakov (2008a, 2008b), Lounici & Nickl (2011), Comte & Lacour (2013) and Lepski & Willer (2019). We also refer to the book of Meister (2009), where many additional references can be found.
Deconvolution estimators are usually constructed using Fourier transform techniques, and the majority of results in the existing literature assumes that the characteristic function of the measurement errors has no zeros on the real line. Specifically, let denote the bilateral Laplace transform of ,
[TABLE]
with being the characteristic function of the measurement errors. The standard assumptions on in the density deconvolution problem are the following:
- (A)
does not vanish, that is, for all ;
- (B)
decreases in an appropriate way as : for some
- (B1)
as ,
- or
- (B2)
as with .
The setting under conditions (A)–(B1) is usually referred to as the case of smooth measurement error densities, while conditions (A)–(B2) correspond to the so–called super–smooth case. Under assumption (A) the achievable estimation accuracy is determined by the rate at which decreases as , and by smoothness of the density . In particular, it is well known that in the smooth case for the Hölder class and for the Sobolev class of regularity one has
[TABLE]
as ; see, e.g., Zhang (1990) and Fan (1991). The definitions of classes and are presented in Section 5. In all what follows we will refer to the rate as the standard rate of convergence.
It is worth noting that the condition (A) is rather restrictive and excludes many settings of interest. This condition does not hold if distribution of the measurement errors is compactly supported. For instance, if is a uniform density on then , and vanishes at , . Another typical situation in which condition (A) is violated is the case of measurement errors having discrete distributions. In general, if has zeros, the standard Fourier–transform–based estimation methods are not directly applicable. This fact raises the following natural questions.
- (i)
How to construct rate-optimal estimators in the case when the assumption (A) does not hold, that is, has zeros, and what is the best achievable rate of convergence under these circumstances ?
- (ii)
Under which conditions on one can achieve the standard rates of convergence (1.2) without assuming (A) ?
The existing literature contains only partial and fragmentary answers to questions (i) and (ii). Devroye (1989) constructed a consistent estimator of under assumption that for almost all . The proposed estimator is a certain modification of the standard Fourier–transform–based kernel density estimator. Hall et al. (2001) consider the setting with the uniform measurement error density and develop an estimator under assumption that density is a compactly supported. Other works dealing with the uniform density deconvolution are Groeneboom & Jongbloed (2003) and Feuerverger et al. (2008). The first cited paper assumes that is non–negative, and shows that for a class of twice continuously differentiable densities, the pointwise risk of the proposed estimators converges to zero at the standard rate corresponding to . Feuerverger et al. (2008) studied the problem of estimating densities from Sobolev functional classes with the –risk; they show that the standard rate of convergence with can be achieved in this setting provided that has two bounded moments. These results demonstrate that, in the problem with uniformly distributed measurement errors and under the aforementioned assumptions on , zeros of the characteristic function of have no effect on the minimax rate of convergence.
Hall & Meister (2007) and Meister (2008) considered density deconvolution problem with an oscillating Fourier transform that vanishes periodically. They proposed several modifications of the standard Fourier–transform–based estimators, considered the –risk and showed that for certain nonparametric classes of probability densities, zeros of the characteristic function do affect the rate of convergence. Delaigle & Meister (2011) demonstrated that if the density to be estimated has a finite left endpoint, then it can be estimated with the standard rate as in the case where does not have zeros. Meister & Neumann (2010) considered a setting where may have zeros, but there are two observations of the same variable with independent measurement errors. In this setting zeros of have no influence on the rate of convergence.
The existing results in the literature leave open a fundamental question about construction of rate–optimal density deconvolution estimators under general assumptions on Specifically, it is not clear whether and under which conditions zeros of have no influence on the minimax rates of convergence.
The current paper addresses the aforementioned issues. First, we develop a methodology for constructing optimal density deconvolution estimators under general conditions on the measurement error distribution. These conditions cover settings with vanishing and non–vanishing characteristic functions of the measurement errors, and the proposed methodology treats all these settings in a unified way. The estimation methods we propose are based on the Laplace transform. In this sense they generalize standard Fourier–transform–based estimation techniques commonly used in the literature on density deconvolution. Second, we derive upper bounds on the risk of the proposed estimators and provide sufficient conditions on under which the standard rate of convergence can be achieved under general assumptions on . In particular, we prove that if, in addition to the smoothness restriction or , density has bounded moments of a sufficiently large order, then the standard rate of convergence can be achieved even without Assumption (A). The required number of bounded moments is characterized in terms of a sequence of coefficients (zero set sequence) which, in turn, is determined by the geometry of zeros of . Third, we specialize our general methodology to specific problem instances in which the zero set sequences can be explicitly calculated. Last but not least, it is also shown that the derived sufficient moment conditions are also necessary in order to guarantee the standard rate of convergence in absence of (A) for some specific problem instances.
The rest of the paper is organized as follows. In Section 2 we present a general idea for construction of proposed estimators. Section 3 introduces assumptions on the distribution of the measurement errors and presents examples of distributions satisfying these assumptions. Section 4 discusses construction of the estimator kernel and develops its infinite series representation. In Section 5 we define the estimator and present upper bounds on its risk. Settings corresponding to specific problem instances are discussed in Section 6, and lower bounds showing necessity of moment conditions are presented in Section 7. Some concluding remarks are brought in Section 8. Proofs of all results are given in Appendix.
1.2 Notation
For a generic locally integrable function the bilateral Laplace transform is defined by
[TABLE]
The Laplace transform is an analytic function in the convergence region of the above integral which, in general, is a vertical strip:
[TABLE]
The convergence region can degenerate to a vertical line , , in the complex plane. If is a probability density, then the imaginary axis always belongs to , that is, , and
[TABLE]
is the characteristic function (the Fourier transform of ). This degenerate case corresponds to distributions whose characteristic function cannot be analytically continued to a strip around the imaginary axes in the complex plane. The inverse Laplace transform is given by the formula
[TABLE]
The uniqueness property of the bilateral Laplace transform states that if in a common strip of convergence then is equal to for almost all (Widder, 1946, Theorem 6b).
2 General idea for estimator construction
Let be the measurement error distribution function with the corresponding Laplace transform
[TABLE]
whose convergence region is denoted . Throughout the paper we suppose that is a vertical strip in the complex plane, with and satisfying (see Assumption 1 in Section 3). As it was discussed above, if has zeros on the imaginary axis in the complex plane, then the usual Fourier–transform–based methods are not directly applicable. We will be mainly interested in this case.
2.1 Linear functional strategy
The construction of our estimators follows the so-called linear functional strategy that is frequently used for solving ill–posed inverse problems [see, e.g., Goldberg (1979) and Anderssen (1980)]. In the context of the density deconvolution problem the main idea of the strategy is as follows. We want to find two kernels, say, and with the following properties:
- (i)
integral approximates “well” the value ;
- (ii)
the kernel is related to the kernel via the equation:
[TABLE]
Under conditions (i) and (ii), the obvious estimator of from the observations is the empirical estimator of the integral on the right hand side of (2.1):
[TABLE]
Let be a kernel with standard properties that will be specified later. For denote . Assume that has bounded support so that is an entire function, that is, . Furthermore, assume that there exist real numbers and satisfying , such that
[TABLE]
In words, is the union of two open strips (with the imaginary axis as the boundary), where the function does not have zeros. Therefore we can define
[TABLE]
and this function is analytic in . Let
[TABLE]
with Observe that the kernel is defined by the inverse Laplace transform of the function , and the denominator of the integrand in (2.4) does not vanish as . If the integral on the right hand side of (2.4) is absolutely convergent then (2.4) defines the same function for any value of or . In other words, depending on the sign of , the equation (2.4) defines two different functions which will be denoted by and , respectively. The estimator of is then defined by
[TABLE]
The parameters and will be specified in the sequel.
2.2 Relationship between kernels and
The following lemma demonstrates that (2.1) holds for the kernels and given by (2.4).
Lemma 1**.**
Suppose that for any the integral on the right hand side of (2.4) is absolutely convergent, and
[TABLE]
then for any
[TABLE]
Note that relation (2.6) holds for both kernels and corresponding to and respectively. Thus, both or can be used in the estimator construction.
Remark 1**.**
A naive approach towards estimator construction could be based on a direct application of the Laplace transform inversion formula. In particular, (1.1) implies that . The empirical estimator of can be constructed in the standard way using the available data ; then a division by with proper regularization and application of the inverse Laplace transform formula yields an estimator of . Note, however, that this approach requires analyticity of in a strip containing the imaginary axis, i.e., must have very light tails. In contrast, our construction does not require existence of for outside the imaginary axis; only the analyticity of is required.
3 Distribution of measurement errors
3.1 Assumptions
Accuracy of the estimator defined in (2.5) will be studied under the following general assumptions on the distribution of the measurement errors.
Assumption 1**.**
The Laplace transform of the measurement error distribution exists in a vertical strip , , and admits the following representation:
[TABLE]
where are positive real numbers, , are non-negative integer numbers, and the pairs , are distinct. The function is represented as
[TABLE]
where , is analytic, does not vanish in a vertical strip with , and .
Several remarks on Assumption 1 are in order.
Remark 2**.**
(i). Assumption 1 states that factorizes into a product of two functions: the first function, , has zeros only on the imaginary axis, while the second one, , does not have zeros in . The latter fact follows from analyticity of in .
(ii). Zeros of on the imaginary axis are , , where , , and the multiplicity of each zero is equal to . Assumption 1 implies that has no zeros in , and (2.2) holds with , that is, and .
(iii). The form of in (3.2) follows from (3.1) and the fact that .
In addition to Assumption 1 we require conditions on the growth of function in (3.1) on the imaginary axis. These conditions are similar to the standard ones in the smooth case [see condition (B1) in Section 1].
Assumption 2**.**
Assume there exist constants , and , such that
[TABLE]
In addition, suppose that
[TABLE]
for all natural and some positive constants
Remark 3**.**
Condition (3.3), when imposed on , is rather standard in the literature; it corresponds to the so-called smooth error densities. As for (3.4), similar restrictions on the derivatives of are usually imposed in the proofs of lower bounds; see, e.g., (Fan, 1991, Theorem 5). We also note that all derivatives of the function exist due to analyticity of . Furthermore, if the inequality
[TABLE]
holds for some , and constant is independent of then the well-known Cauchy derivative estimates imply (3.4).
3.2 Examples of distributions
Assumptions 1 and 2 define a broad class of distributions containing densities with characteristic functions that vanish on the real line. Moreover, discrete distributions are covered by Assumptions 1 and 2. All this is illustrated in the following examples.
Example 1** (Uniform distribution).**
Let then
[TABLE]
In this case representation (3.1) holds with , , , and , . Clearly, satisfies Assumption 2 with . Note that has simple zeros on the imaginary axis at , , and . **
Example 2** (Convolution of uniform distributions).**
Consider a convolution of the uniform distributions , with distinct parameters , each of multiplicity In this case
[TABLE]
Therefore Assumption 1 holds with , , , , and
[TABLE]
Thus, satisfies Assumption 2 with . Of special interest is the case of identical uniform distributions . Here , , , , and . Note also that in this case . **
Example 3** (Discrete distributions).**
Let be a discrete random variable taking values in the set , with corresponding probabilities , , where . Then
[TABLE]
where . Let be the roots of the polynomial ; note that , Then we have
[TABLE]
and is an entire function, i.e., . Representations (3.1) and (3.2) hold with
[TABLE]
and \Sigma_{\psi}=\big{\{}z:b^{-1}\ln(\lambda_{-})<{\rm Re}(z)<b^{-1}\ln(\lambda_{+})\big{\}}, where , and . In this example if all with are distinct, then , , and . It is obvious that Assumption 2 holds with .
In the special case of the Bernoulli distribution with the success probability parameter we have ; hence (3.1) holds with , , , , and . If is a binomial random variable with the number of trials and a success probability , then , and (3.1) holds with , , , , and . **
Example 4** (Convolution of uniform and smooth densities).**
Let be a probability density with Laplace transform defined in a strip such that , . Assume also that for some as , that is, satisfies the standard conditions of the smooth case. Let be a convolution of the uniform density on with ; then
[TABLE]
and (3.1) obviously holds with . For instance, let be a density of the Gamma distribution with parameters and , that is, , Then , , and . **
4 Kernel representation
Under Assumption 1 kernel defined in (2.4) is rewritten as follows
[TABLE]
where is the set on which does not vanish. Thus, for any the denominator of the integrand in (4.1) is non–zero. Below we demonstrate that can be formally represented as an infinite series.
4.1 Infinite series representation
To develop the infinite series representation we need the following notation. According to Assumption 1, the set of zeros of on the imaginary axis is determined by three -tuples , and . For a given vector define
[TABLE]
The set can be represented as an ordered set of real numbers , where . Define also
[TABLE]
and let
[TABLE]
In fact, is the number of weak compositions of into parts [see, e.g., (Stanley, 1997, p. 25)]. Recall that an –tuple of non–negative integers with is called a weak composition of into parts.
Lemma 2**.**
Let Assumption 1 hold, and \int_{-\infty}^{\infty}\big{|}\widehat{K}(i\omega h)\widehat{\psi}(-i\omega)\big{|}\mathrm{d}\omega<\infty.
- (a)
If then
[TABLE]
provided that the summation on the right hand side of (4.3) defines a finite function for any .
- (b)
If then
[TABLE]
provided that the summation on the right hand side of (4.5) is finite for any .
Remark 4**.**
(i).* Lemma 2 shows that under Assumption 1 kernel can be represented as infinite linear combination of one–sided translations of , where the translation parameter takes values in the set .*
(ii).* The coefficients and of the linear combination are completely determined by the structure of the zero set of on the imaginary axis. The sequences , will play an important role in the sequel, and we call them the zero set sequences. The definitions in (4.4) and (4.6) imply that the coefficients , may grow at most polynomially in as . Note also that and .*
4.2 Kernel representation in specific problem instances
In general, determination of coefficients and in (4.4) and (4.6) is difficult. It is instructive to apply the result of Lemma 2 to some particular cases of Examples 1–4 where the zero set sequences and the corresponding kernels can be explicitly calculated.
Uniform distribution
This is the setting of Example 1. Recall that here , , , ; hence , and , for all , . Since ,
[TABLE]
Thus, in view of (4.3) and (4.5)
[TABLE]
If is a bounded continuously differentiable kernel with finite support, and is small enough then formulas in (4.7) define functions and which are finite for any .
Convolution of uniform distributions
We consider two specific cases of Example 2.
(a). Consider convolution of identical uniform distributions . In this case , , and . Thus, , , , . Since ,
[TABLE]
Therefore
[TABLE]
Similarly to the previous example, if is times continuously differentiable with a finite support, and is small enough then the formulas define finite functions for any .
(b). Consider convolution of uniform distributions , with distinct , . In this case , , for . Thus, , and if for some then is the number of non–negative integer solutions to the equation
[TABLE]
and . It is clear that there is at least one solution ; the total number of solutions depends on . For instance, assume that , , where are coprime integer numbers. Then with is the number of representations of the integer number by non–negative integer linear combination of . By Schur’s theorem [see, e.g., (Wilf, 2006, Section 3.15)]
[TABLE]
It follows from (3.5) that
[TABLE]
and therefore
[TABLE]
Thus,
[TABLE]
where , and the sequence satisfies (4.8). If kernel is times continuously differentiable and has bounded support then the last formulas define functions which are finite for any fixed .
Binomial distribution
Assume that the measurement error distribution is binomial with parameters and ; this is a particular case of Example 3. Here , , , and . Hence , , ,
[TABLE]
and
[TABLE]
5 Estimator and upper bounds on the risk
Based on the general ideas presented in Section 2 and kernel representations developed in Section 4 we are now in a position to define the proposed estimator of and to study its accuracy.
5.1 Estimator
We assume that kernel is chosen to satisfy the following condition.
- (K)
Let be a function supported on such that for a fixed positive integer
[TABLE]
Condition (K) is standard in nonparametric kernel density estimation; clearly, one can always construct kernel satisfying (K) with prescribed parameter .
Let be a natural number, and denote
[TABLE]
The estimator of is defined as follows
[TABLE]
where we set
[TABLE]
and
[TABLE]
In what follows we will write and for the estimator (5.1) associated with and respectively. Let finally
[TABLE]
Recall that the function and the sequences , are defined in (4.2), (4.3) and (4.5), respectively. The estimator construction follows the linear functional strategy of Section 2 in conjunction with the kernel representation developed in Section 4. Note that we truncate the infinite series kernel representation by the cut–off parameter ; this introduces some bias but ensures that the integral on the left hand side of (2.6) is absolutely convergent. The estimator requires specification of and ; this will be done in the sequel.
5.2 Functional classes
Now we define functional classes over which accuracy of the proposed estimators will be assessed. The next two definitions introduce standard classes of smooth functions.
Definition 1**.**
Let , be real numbers. We say that a probability density belongs to the functional class if is times continuously differentiable and
[TABLE]
Definition 2**.**
For real numbers and we say that a probability density belongs to the functional class if is times differentiable and
[TABLE]
We will also consider classes of probability densities with bounded moments.
Definition 3**.**
Let and be real numbers. We say that a probability density belongs to the functional class if
[TABLE]
We also denote the class of all densities from satisfying the following additional condition:
[TABLE]
where is a constant appearing in Assumption 2.
Remark 5**.**
(i).* Condition (5.5) is rather mild. Note that implies boundedness of for all , and in (5.5) we require integrability of with the weight . If and then (5.5) holds trivially: with where . Therefore in the definition of the restriction (5.5) is active only if .*
(ii).* If then is uniformly bounded above by a constant depending on only. However, for the sake of convenience, we explicitly require boundedness of in the definition of the class .*
We also denote
[TABLE]
5.3 Upper bounds
In this section we derive an upper bound on the maximal risk of the estimator (5.1) under Assumptions 1, 2, and under the following additional condition on the growth of zero set sequences and .
Assumption 3**.**
Assume that
[TABLE]
for some and
Theorem 1**.**
Suppose that Assumptions 1, 2 and 3 hold. Let be associated with kernel satisfying the condition (K) with .
- (a)
Assume that with Let h=h_{*}:=\big{[}B(A^{2}n)^{-1}\big{]}^{1/(2\alpha+2\gamma+1)} and N\geq\big{(}A^{-2\gamma+1}B^{2\gamma+\alpha}n^{\alpha+1}\big{)}^{1/p(2\alpha+2\gamma+1)}. Then for all large enough one has
[TABLE]
where may depend on and only.
- (b)
Let with Let N\geq\big{(}A^{-2\gamma+1}B^{2\gamma+\alpha}n^{\alpha+1}\big{)}^{2/(2p-1)(2\alpha+2\gamma+1)} and . Then for all large enough one has
[TABLE]
where may depend on and only.
Remark 6**.**
(i).* It is well known that under assumptions , and , , we have and as . Theorem 1 provides conditions on that guarantee the standard rate of convergence in the case when may have zeros. In particular, conditions with and are sufficient in order to ensure the standard rate for the pointwise and –risks, respectively. It is worth noting that the –risk bound requires stronger condition; as we will show in Section 7, this is an intrinsic feature of the problem.*
(ii).* The result of Theorem 1 is rather general: it holds for any configuration of zeros of and function satisfying Assumption 2. One interesting implication is that for discrete error distributions such as in Example 3, the achievable rate of convergence is provided that has a finite absolute moment of sufficiently large order. Note that is the minimax rate of convergence in the problem of density estimation from direct i.i.d. observations.*
(iii).* If the measurement error distribution is uniform, then Assumption 3 holds with for arbitrary small . Therefore Theorem 1 implies that the standard rate of convergence of the pointwise and –risks is achieved if with and , respectively.*
In some specific cases when closed form expressions for the zero sequences and and function are available, the conditions of Theorem 1 can be relaxed. We demonstrate this in the next section.
6 Specific problem instances
In this section we consider specific distributions of measurement errors for which conditions of Theorem 1 can be relaxed.
6.1 Convolution of uniform distributions
Consider a particular case of Example 2 where , . This setting also covers Example 1 that corresponds to . Recall that in this case , , and therefore Assumption 3 is valid for any . Then Theorem 1 implies that the pointwise and –risks converge to zero at the standard rate provided that with , and , respectively. In fact, as the following result demonstrates, these conditions are too strong: the standard rate is achievable if for the pointwise risk, and if for the –risk. Recall that and
[TABLE]
The corresponding kernels are
[TABLE]
Theorem 2**.**
Let , . Let be a kernel satisfying the condition (K) with , and let denote the estimator defined in (5.1)–(5.4) and associated with the kernels (6.2) and (6.2).
- (a)
Assume that with if , and if . Let h=h_{*}:=\big{[}B(A^{2}n)^{-1}\big{]}^{1/(2\alpha+2m+1)}, and N\geq\big{(}A^{-2m+1}B^{2m+\alpha}n^{\alpha+1}\big{)}^{1/p(2\alpha+2m+1)}. Then for large enough
[TABLE]
where may depend on , and only.
- (b)
Let with , , and . Then for all large enough
[TABLE]
where may depend on , and only.
Remark 7**.**
(i).* In contrast to the proof of Theorem 1, the proof of Theorem 2 relies on a closed form expressions for kernels [cf. (6.2), (6.2)]. In this case support of function has a “small” length, and this fact is crucial for relaxing assumptions of Theorem 1.*
(ii).* Theorem 2 shows that the standard rate of convergence is achieved*
- •
by the maximal pointwise risk over if when , and for any when ;
- •
by the maximal –risk over if , .
In contrast, Theorem 1 requires with and , respectively. The difference in conditions is particularly noticeable in the case of the uniform density where . Indeed, while Theorem 1 requires finiteness of -th moment with for the pointwise risk and for the –risk, Theorem 2 shows that it is sufficient to require conditions and , . It also follows from the proof that assumption for the pointwise risk can be further relaxed: any uniform decrease of as will be sufficient.
6.2 Binomial distribution
In this section we consider a specific case of Example 3 of Section 3.2, where the measurement errors have binomial distribution with parameters and . Here . Recall that in this case , , ,
[TABLE]
and
[TABLE]
Theorem 3**.**
Let , . Fix some Let be a kernel satisfying condition (K) with , and let denote the estimator defined in and (5.1)–(5.4) and associated with the kernels in (6.3).
- (a)
Assume that with if , and if . Let , and . Then for large enough
[TABLE]
where may depend on only.
- (b)
Assume that for some . Let , and . Then for all large enough
[TABLE]
where may depend on only.
The proof is omitted as it goes along the same lines as the proof of Theorem 2 with minor modifications.
7 Lower bounds: necessity of moment conditions
Theorem 2 shows that if the error distribution is the –fold convolution of the uniform distributions on then the maximal pointwise and –risks on the classes and converge to zero at the standard rate , provided that moment conditions hold. The following theorem demonstrates that these moment conditions are also necessary.
Theorem 4**.**
Let , where is integer.
- (a)
If with then
[TABLE]
- (b)
Moreover, if with then
[TABLE]
Remark 8**.**
The theorem states that under conditions and the standard rates of convergence cannot be attained in estimation with pointwise and –risks, respectively. On the other hand, we have constructed an estimator that achieves the standard rate of convergence, provided that , and , in the former case, and in the latter case; see Theorem 2. Thus, the indicated moment conditions are necessary for convergence of the risks at the standard rate.
8 Concluding remarks
We close this paper with some concluding remarks.
1. The proposed estimator in (5.4) is associated with the one–sided kernels and , which were used for positive and negative values of respectively. This definition was adopted for the sake of convenience and unification of proofs. In fact, a closer inspection of the proofs of Theorems 1 and 2 shows that for any one can construct an estimator relying on any one of these two kernels with the same risk guarantees. In this case the parameter should be chosen depending on .
-
In this paper we considered the functional class of densities satisfying certain moment conditions. It is worth noting that the proposed estimators can be analyzed under other assumptions as well. For instance, if the support of has a finite left endpoint, then there is no need to assume that . Indeed, the proof of Theorem 2 shows that the accuracy of () is determined by the right (left) tail of . Therefore if the support of has a finite left endpoint, then it is reasonable to use the estimator whose risk will converge to zero at the standard rate. This fact connects our result to those of Groeneboom & Jongbloed (2003) and Delaigle & Meister (2011).
-
The following lower bounds on the minimax risks and can be extracted from the proof of Theorem 4. If the measurement error distribution is a –fold convolution of the uniform distribution then for any
[TABLE]
Observe that if , and for all . These results should be compared with the upper bounds in Theorem 2. In particular, even in the case of the uniform error density there is a significant difference in the behavior of the minimax risks and : while the former is of the order , the latter one converges to zero at the rate slower than for any small . It is worth noting that some lower bounds on the minimax –risk are reported in Hall & Meister (2007) and Meister (2008). However, these bounds are not directly comparable with ours since the considered functional classes and assumptions on in the above papers are different from ones adopted in our paper. Moreover we mainly focus here on the minimal conditions needed to preserve the standard convergence rates and do not consider the problem of constructing optimal (in minimax sense) estimators in the case where these conditions are violated.
-
We focused on the setting when characteristic function of measurement errors has zeros on the imaginary axis and decreases at a polynomial rate. This corresponds to the case of smooth error densities. The super–smooth case when the characteristic function decreases at an exponential rate can be also considered within the proposed framework. This assumption leads to slow logarithmic rates, and it can be shown that zeros of the error characteristic function do not affect the minimax rates of convergence, i.e., the standard minimax rates are preserved with no additional tail conditions.
-
The proposed estimators are not adaptive in the sense that they require knowledge of the underlying functional classes. However, based on the results of this paper, adaptive estimators can be developed using standard methods [see, e.g., Lepski (1990) and Goldenshluger & Lepski (2011)]. We do not pursue this direction in the current paper.
-
Johnstone & Raimondo (2004) considered a closely related problem of signal deconvolution in the periodic Gaussian white noise model , , where is a boxcar kernel, that is, , and is the standard two–sided Wiener process. If is a rational number then the signal is non–identifiable. Assuming that is irrational, Johnstone & Raimondo (2004) studied behavior of the minimax –risk over the classes of ellipsoids and hyperrectangles defined on the Fourier coefficients of . They show that the minimax rates of convergence for the –risk are affected by an oscillating behavior of the Fourier coefficients of the boxcar kernel. Our results suggest that if the assumption of periodicity of and is dropped then the minimax –risk over the class should be of the standard order . We plan to study these signal deconvolution models in our future research.
Appendix
Proof of Lemma 1
Fix . By the Fubini’s theorem
[TABLE]
Now we show that for almost all
[TABLE]
Applying the bilateral Laplace transform to the left hand side of the previous display formula we obtain
[TABLE]
In view of (2.3), the function on the right hand side is analytic and equal to on . On the other hand,
[TABLE]
Thus, the bilateral Laplace transforms of the functions on both sides of (.1) coincide on ; therefore (.1) holds by the uniqueness property of the bilateral Laplace transform. This implies the lemma statement.
Proof of Lemma 2
(a). For , and we have
[TABLE]
Therefore
[TABLE]
and it follows from (4.1) that
[TABLE]
where the third line follows from analyticity of the integrand, and is defined in (4.4). Note that the change of the order of integration and summation is permissible under the premise of the lemma.
(b). If then
[TABLE]
and, similarly to the above,
[TABLE]
which yields
[TABLE]
where is defined in (4.6). This completes the proof.
Proof of Theorem 1
Throughout the proof we keep track of dependence of all constants on parameters of the classes and . In what follows stand for constants that can depend on parameters appearing in Assumptions 1, 2 and 3, and on and only. For the sake of brevity, in the subsequent proof we do not indicate integration limits if the corresponding integrals are taken over the entire real line.
Proof of statement (a)
We assume that and consider the estimator only; the derivation for and is similar.
- First we verify that under Assumption 2 and condition (K) the estimator is well defined. Because has finite support and it is infinitely differentiable on the real line, is also infinitely differentiable and rapidly decreasing as in the sense that for all and . In particular, for by (3.3) of Assumption 2 we have
[TABLE]
Thus function in (4.2) and kernels and in (5.3) are well defined.
- First we derive an upper bound on the bias of . We have
[TABLE]
It follows from definition of [cf. (4.4)] that
[TABLE]
Now noting that
[TABLE]
we obtain
[TABLE]
Substituting this expression in (.2), and taking into account (3.1) and we obtain
[TABLE]
where we denote
[TABLE]
If then
[TABLE]
where we denoted and took into account that for large enough . Then (.3), (.4), condition (K) and the fact that imply the following upper bound on the bias
[TABLE]
- Now we bound the variance of . We need the following notation. For non–negative integer number we let
[TABLE]
where , and and are given in (4.4) and (4.6).
We have
[TABLE]
where for we put
[TABLE]
where . Since is rapidly decreasing as and in view of (3.4) we have that as for all , where appears in Assumption 2. Recall that by premise of the theorem .
Now, we proceed with bounding the terms , on the right hand side of (.7). First, since , and therefore for all . By Parseval’s identity and Assumption 2
[TABLE]
Furthermore, implies that , , and since is analytic, the derivatives are finite for . Therefore for any integer by repeated integration by parts with respect to in (.8) we obtain for
[TABLE]
In the first line we have taken into account that as for all , and . Now, invoking (.6) we have
[TABLE]
The Cauchy–Schwarz inequality applied to the double integral on the right hand side yields
[TABLE]
We bound similarly. In particular, for any integer such that repeated integration by parts in (.8) first with respect to and then with respect to yields
[TABLE]
Hence
[TABLE]
and by the Cauchy–Schwarz inequality
[TABLE]
Combining the above bounds on , and we obtain that for any integer number such that one has
[TABLE]
Now we bound the integrals on the right hand side of the above display formula.
Note that for any
[TABLE]
In view of (3.1) for
[TABLE]
It is obvious that for all . Furthermore, by the Faá di Bruno formula
[TABLE]
where are the Bell polynomials. Recall that is a homogeneous polynomial in variables of degree . This fact and (3.4) imply that for
[TABLE]
Using (3.3) we obtain for ; hence , . This inequality in conjunction with boundedness of and , for all implies that for and
[TABLE]
Thus, if with then
[TABLE]
and therefore
[TABLE]
Taking into account that
[TABLE]
we have
[TABLE]
Combining the above bounds on the bias and variance we obtain that for any non–negative integer number satisfying one has
[TABLE]
where and may depend on and only.
To complete the proof of statement (a) it is suffices to note that under Assumption 3, , provided that . Therefore, in view of Assumption 2 and condition (K) the last term on the right hand side of (.12) is bounded above by . Then the announced result follows by substitution of the values of and in inequality (.12).
Proof of statement (b)
The proof uses pointwise bounds derived in the proof of statement (a).
- To derive the upper bound on the integrated squared bias consider equality (.3). First we note the standard bound (Tsybakov, 2009, Section 1.2.3):
[TABLE]
Moreover, if then by (.4)
[TABLE]
The same upper bound holds for the integral of the squared bias of the estimator over . Thus,
[TABLE]
- Now consider the variance term. We use the variance decomposition given in (.7). It follows from (.9) that
[TABLE]
Furthermore, we note that for
[TABLE]
and the same inequality holds for the integral . By Assumption 3, if then the sum on the right hand side of (.14) is uniformly bounded in . Therefore using (.10) and applying the Cauchy–Schwarz inequality we obtain
[TABLE]
The term originating from is bounded similarly. The bound (.14) together with with (.11) yields
[TABLE]
Combining the obtained inequalities with the bound on the bias and using the same reasoning as in the proof of Theorem 1 we conclude that for we have
[TABLE]
Substitution of the values and completes the proof of statement (b).
Proof of Theorem 2
In the subsequent proof we keep track of all constants depending on parameters of classes and . In what follows denote positive constants that depend on and only.
Proof of statement (a)
We provide the proof of statement (a) for the estimator corresponding to only. The proof for is identical in every detail. The estimator is given by the formula
[TABLE]
The variance of this estimator is bounded as follows:
[TABLE]
Assume that (this is always fulfilled for large ), and denote
[TABLE]
Since and , the intervals and are disjoint for . Therefore
[TABLE]
where we have used that is bounded above by a constant. Now we bound the sum on the right hand side of (.15).
First, note that is supported on , and . Therefore
[TABLE]
Furthermore, writing for brevity we have
[TABLE]
and since we obtain
[TABLE]
Note that C_{j,m}\leq\big{[}(j+m-1)/(m-1)\big{]}^{m-1}e^{m-1}\leq c_{4}j^{m-1} for . Taking into account that we have
[TABLE]
where the first line follows from elementary inequality for , and non–negative , while the second line follows from the fact that the integrals under the sum are taken over disjoint intervals because . The similar upper bound holds for the sum corresponding to the second integral on the right hand side of (.16). The expression corresponding to the third integral is bounded as follows
[TABLE]
where the last inequality holds because
[TABLE]
Therefore
[TABLE]
[TABLE]
where
[TABLE]
Letting and taking into account that , and we have for any
[TABLE]
These inequality yields , and since ,
[TABLE]
Then statement (a) follows immediately from the established bounds on the bias and variance by substitution of the chosen values of and .
Proof of statement (b)
We start with the bounding the variance term. The basis for the derivation is formula (.15) that should be integrated over . In view of (.16) and the subsequent formulas
[TABLE]
Since we have
[TABLE]
Hence taking into account that and integrating with respect to we obtain
[TABLE]
Using the same reasoning it is immediate to show that the same bound holds for the integrals of and : , . Combining these bounds with (.15) we obtain
[TABLE]
The integral over the negative semi–axis is bounded similarly.
To bound the integrated squared bias we note that (.17)–(.18) and imply
[TABLE]
and the same estimate holds for the integral of over the negative semi–axis. In view of (.13) we obtain
[TABLE]
Then the statement follows from the established upper bounds on the integrated variance and the integrated squared bias.
Proof of Theorem 4
In the subsequent proof stand for positive constants that do not depend on . The proof of (7.1) is based on the standard reduction to a two–point testing problem, while in the proof of (7.2) we use reduction to the problem of testing multiple hypotheses [see (Tsybakov, 2009, Chapter 2)].
Proof of statement (a)
Let be a real number and consider the probability density
[TABLE]
where is a normalizing constant. Clearly, for and sufficiently large constant depending on . In addition, is infinitely differentiable and belongs to for any and large enough .
Pick function with the following properties:
- (i)
, ;
- (ii)
for with some fixed , and , ;
- (iii)
monotonically climbs from [math] to on . In addition, is infinitely differentiable function on the real line.
Let be a small real number such that , and be an integer number. Define
[TABLE]
Note that is even and supported on the union of disjoint sets , where . Function is given by the inverse Fourier transform:
[TABLE]
For real numbers and define
[TABLE]
- We demonstrate that under appropriate choice of constants , and function is a probability density, and it belongs to with .
First, we note that because . Second, since is infinitely differentiable, is rapidly decreasing, and for some constant depending on . Therefore for all , and if we set
[TABLE]
then by choice of constant we can ensure that for all . Thus, is a probability density provided (.21) holds. This also shows that for .
For simplicity assume that is integer; then
[TABLE]
This implies that if .
- Without loss of generality we consider the problem of estimating the value . Note that we have
[TABLE]
- Now we bound from above the –divergence between densities of observations and corresponding to the hypotheses and . We have
[TABLE]
Since is supported on we have
[TABLE]
and
[TABLE]
Therefore
[TABLE]
Now we bound integrals and on the right hand side. First we assume that is an integer number. Then by Parseval’s identity
[TABLE]
Furthermore,
[TABLE]
We have
[TABLE]
By the Faá di Bruno formula for
[TABLE]
where are the Bell polynomials, and . First, we note that for any . Using this fact and taking into account that is a homogeneous polynomial in variables of degree we have
[TABLE]
Taking into account this inequality, (.23), (.22), the fact that is supported on , and recalling that are disjoint for all , we obtain
[TABLE]
Combining these bounds we obtain that for integer
[TABLE]
The same upper bound holds for non–integer . Indeed, it follows from the above bounds on and that for integer
[TABLE]
Then for real by the interpolation inequality for the Sobolev spaces [see, e.g., (Aubin, 2000, Proposition 6.3.3)] we have
[TABLE]
which yields (.25) for real . The choice
[TABLE]
ensures that and are not distinguishable from observations which leads to the lower bound
[TABLE]
- The rate of convergence obtained in (.26) dominates the standard rate if
[TABLE]
Therefore if then the standard rate of convergence is not achievable. This completes the proof of (7.1).
Proof of statement (b)
Let , and be defined in (.19). Clearly, given by (.19) satisfies for and large enough and .
Let be the function satisfying conditions (i)–(iii) in the proof of statement (a). For positive integer define
[TABLE]
Note that is even, supported on , and and , have disjoint supports because . Moreover,
[TABLE]
Let be an integer number. Define the following family of functions
[TABLE]
where and are real numbers.
- First we demonstrate that , is a probability density from the class , provided that constants , and are chosen in an appropriate way. We have because for all . Moreover, similarly to the proof of statement (a), since is rapidly decreasing,
[TABLE]
Therefore, if we let
[TABLE]
then by an appropriate choice of constant we can ensure that for all . Thus, is indeed a probability density for any . This also shows that with . Furthermore,
[TABLE]
where the last expression follows from (.27). Therefore, if
[TABLE]
then for any .
By the Varshamov–Gilbert lemma [see, e.g. (Tsybakov, 2009, Lemma 2.9)] there exists a subset of such that , , and any pair of vectors are distinct in at least entries (the Hamming distance between and is at least ). In what follows we consider the family functions . Clearly, for any we have
[TABLE]
- Next, we bound the –divergence between distributions of observations corresponding to and , . As in the proof of statement (a) we have
[TABLE]
Furthermore,
[TABLE]
An upper bound on is derived as in the proof of statement (a). In particular, taking into account that is a sum of functions with disjoint supports, and repeating the steps from (.22) to (.24) we obtain
[TABLE]
Combining these bounds on and we obtain
[TABLE]
- To complete the proof we use Theorem 2.7 from Tsybakov (2009); see also Lemma 4 from Goldenshluger & Lepski (2014). In particular, this result implies that if and satisfy
[TABLE]
then for large one has , where , and is defined in (.29). Now we choose and so that (.30) and (.28) are satisfied. To this end define
[TABLE]
and set
[TABLE]
With this choice (.28) and (.30) hold, and
[TABLE]
where \beta:=2\big{[}(2m+2)(\mu/\varkappa)-1\big{]}/(2\alpha-1). The rate of convergence obtained in (.29) dominates the standard rate of convergence if
[TABLE]
which is equivalent to . Therefore if then the standard rate of convergence is not attained. This completes the proof.
Acknowledgement
The authors are grateful to Taeho Kim for careful reading and useful remarks. This article was prepared within the framework of the HSE University Basic Research Program and funded by the Russian Academic Excellence Project ’5-100’. AG is supported by the ISF Reserarch Grant no. 361/15.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Anderssen (1980) Anderssen, R. S. (1980). On the use of linear functionals for Abel–type integral equations. The Application and Numerical Solution of Integral Equations. Edited by Anderssen, R and De Hoog, F. and Lucas, M., 195–221, Sijthoff and Noordhof International Publishers.
- 3Aubin (2000) Aubin, J.-P. (2000). Applied Functional Analysis. Second edition. John Wiley & Sons, New York.
- 4Butucea & Tsybakov (2008 a, 2008 b) Butucea, C. and Tsybakov, A. B. (2008 a). Sharp optimality in density deconvolution with dominating bias. I. Theory Probab. Appl. 52 , 24–39.
- 5Buticea & Tsybakov (2008 b) Butucea, C. and Tsybakov, A. B. (2008 b). Sharp optimality in density deconvolution with dominating bias. II. Theory Probab. Appl. 52 , 237–249.
- 6Carroll & Hall (1988) Carroll, R. J. and Hall, P. (1988). Optimal rates of convergence for deconvolving a density. J. Amer. Statist. Assoc. 83 , 1184-1186.
- 7Comte & Lacour (2013) Comte, F. and Lacour, C. (2013). Anisotropic adaptive kernel deconvolution. Ann. Inst. Henri Poincaré Probab. Stat. 49 , 569–609.
- 8Delaigle & Meister (2011) Delaigle, A. and Meister, A. (2011). Nonparametric function estimation under Fourier–oscillating noise. Statist. Sinica 21 , 1065–1092.
