Estimation in the convolution structure density model. Part I: oracle inequalities
Oleg Lepski, Thomas Willer

TL;DR
This paper develops a new pointwise selection rule for kernel estimators in the convolution structure density model, establishing oracle inequalities and demonstrating near-optimal adaptive minimax estimation under $L_p$-loss.
Contribution
It introduces a novel pointwise selection rule for kernel estimators in the convolution structure density model, with proven oracle inequalities and adaptive minimax optimality results.
Findings
Established $L_p$-norm oracle inequalities for the selected estimator.
Proved the proposed selection rule yields nearly optimal adaptive estimators.
Fully characterized the minimax risk behavior over anisotropic Nikol'skii classes.
Abstract
We study the problem of nonparametric estimation under -loss, , in the framework of the convolution structure density model on . This observation scheme is a generalization of two classical statistical models, namely density estimation under direct and indirect observations. In Part I the original pointwise selection rule from a family of "kernel-type" estimators is proposed. For the selected estimator, we prove an -norm oracle inequality and several of its consequences. In Part II the problem of adaptive minimax estimation under --loss over the scale of anisotropic Nikol'skii classes is addressed. We fully characterize the behavior of the minimax risk for different relationships between regularity parameters and norm indexes in the definitions of the functional class and of the risk. We prove that the selection rule proposed in Part I leads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Mathematical Approximation and Integration · Risk and Portfolio Optimization
Estimation in the convolution structure density model. Part I: oracle inequalities.
O.V. Lepski label=e1][email protected] [
T. Willer label=e2][email protected] [ Aix Marseille Univ, CNRS, Centrale Marseille, I2M, Marseille, France
Institut de Mathématique de Marseille
Aix-Marseille Université
39, rue F. Joliot-Curie
13453 Marseille, France
Abstract
We study the problem of nonparametric estimation under -loss, , in the framework of the convolution structure density model on . This observation scheme is a generalization of two classical statistical models, namely density estimation under direct and indirect observations. In Part I the original pointwise selection rule from a family of ”kernel-type” estimators is proposed. For the selected estimator, we prove an -norm oracle inequality and several of its consequences. In Part II the problem of adaptive minimax estimation under –loss over the scale of anisotropic Nikol’skii classes is addressed. We fully characterize the behavior of the minimax risk for different relationships between regularity parameters and norm indexes in the definitions of the functional class and of the risk. We prove that the selection rule proposed in Part I leads to the construction of an optimally or nearly optimally (up to logarithmic factor) adaptive estimator.
62G05, 62G20,
deconvolution model,
density estimation,
oracle inequality,
adaptive estimation,
kernel estimators,
–risk,
anisotropic Nikol’skii class,
keywords:
[class=AMS]
keywords:
\startlocaldefs\endlocaldefs
t1This work has been carried out in the framework of the Labex Archimède (ANR-11-LABX-0033) and of the A*MIDEX project (ANR-11-IDEX-0001-02), funded by the ”Investissements d’Avenir” French Government program managed by the French National Research Agency (ANR).
1 Introduction
In the present paper we will investigate the following observation scheme introduced in Lepski and Willer (2017). Suppose that we observe i.i.d. vectors with a common probability density satisfying the following structural assumption
[TABLE]
where and are supposed to be known and is the function to be estimated. We will call the observation scheme (1.1) convolution structure density model.
Here and later, for two functions f,g\in{\mathbb{L}}_{1}\big{(}{\mathbb{R}}^{d}\big{)}
[TABLE]
and for any , g\in{\mathbb{L}}_{1}\big{(}{\mathbb{R}}^{d}\big{)} and ,
[TABLE]
Here \mathfrak{P}\big{(}{\mathbb{R}}^{d}\big{)} denotes the set of probability densities on , is the ball of radius in {\mathbb{L}}_{s}\big{(}{\mathbb{R}}^{d}\big{)}:={\mathbb{L}}_{s}\big{(}{\mathbb{R}}^{d},\nu_{d}\big{)},1\leq s\leq\infty and is the Lebesgue measure on .
We remark that if one assumes additionally that f,g\in\mathfrak{P}\big{(}{\mathbb{R}}^{d}\big{)}, this model can be interpreted as follows. The observations can be written as a sum of two independent random vectors, that is,
[TABLE]
where are i.i.d. -dimensional random vectors with a common density , to be estimated. The noise variables are i.i.d. -dimensional random vectors with a known common density . At last are i.i.d. Bernoulli random variables with , where is supposed to be known. The sequences , and are supposed to be mutually independent.
The observation scheme (1.2) can be viewed as the generalization of two classical statistical models. Indeed, the case corresponds to the standard deconvolution model . Another ”extreme” case corresponds to the direct observation scheme . The ”intermediate” case , considered for the first time in Hesse (1995), can be treated as the mathematical modeling of the following situation. One part of the data, namely , is observed without noise, while the other part is contaminated by additional noise. If the indexes corresponding to that first part were known, the density could be estimated using only this part of the data, with the accuracy corresponding to the direct case. The question we address now is: can one obtain the same accuracy if the latter information is not available? We will see that the answer to the aforementioned question is positive, but the construction of optimal estimation procedures is based upon ideas corresponding to the ”pure” deconvolution model.
The convolution structure density model (1.1) will be studied for an arbitrary g\in{\mathbb{L}}_{1}\big{(}{\mathbb{R}}^{d}\big{)} and . Then, except in the case , the function is not necessarily a probability density.
We want to estimate using the observations . By estimator, we mean any -measurable map \hat{f}:{\mathbb{R}}^{n}\to{\mathbb{L}}_{p}\big{(}{\mathbb{R}}^{d}\big{)}. The accuracy of an estimator is measured by the –risk
[TABLE]
where denotes the expectation with respect to the probability measure of the observations . Also, , , is the -norm on and without further mentioning we will assume that f\in{\mathbb{L}}_{p}\big{(}{\mathbb{R}}^{d}\big{)}. The objective is to construct an estimator of with a small –risk.
1.1 Oracle approach via local selection. Objectives of Part I
Let {\cal F}=\big{\{}\hat{f}_{\mathfrak{t}},\mathfrak{t}\in\mathfrak{T}\big{\}} be a family of estimators built from the observation . The goal is to propose a data-driven (based on ) selection procedure from the collection and to establish for it an -norm oracle inequality. More precisely, we want to construct a -measurable random map and prove that for any and
[TABLE]
Here and are numerical constants which may depend on and only.
We call (1.3) an -norm oracle inequality obtained by local selection, and in Part I we provide with an explicit expression of the functional in the case where {\cal F}={\cal F}\big{(}{\cal H}^{d}\big{)} is the family of ”kernel-type” estimators parameterized by a collection of multi-bandwidths . The selection from the latter family is done pointwisely, i.e. for any , which allows to take into account the ”local structure” of the function to be estimated. The -norm oracle inequality is then obtained by the integration of the pointwise risk of the proposed estimator, which is a kernel estimator with the bandwidth being a multivariate random function. This, in its turn, allows us to derive different minimax adaptive results presented in Part II of the paper. They are obtained thanks to an unique -norm oracle inequality.
Our selection rule presented in Section 2.1 can be viewed as a generalization and modification of some statistical procedures proposed in Kerkyacharian et al. (2001) and Goldenshluger and Lepski (2014). As we mentioned above, establishing (1.3) is the main objective of Part I. We will see however that although will be presented explicitly, its computation in particular problems is not a simple task. The main difficulty here is mostly related to the fact that (1.3) is proved without any assumption (except for the model requirements) imposed on the underlying function . It turns out that under some nonrestrictive assumptions imposed on , the obtained bounds can be considerably simplified, see Section 2.3. Moreover these new inequalities allow to better understand the methodology for obtaining minimax adaptive results by the use of the oracle approach.
1.2 Adaptive estimation. Objectives of Part II
Let be a given subset of {\mathbb{L}}_{p}\big{(}{\mathbb{R}}^{d}\big{)}. For any estimator , define its maximal risk by {\cal R}^{(p)}_{n}\big{[}\tilde{f}_{n};\mathbb{F}\big{]}=\sup_{f\in\mathbb{F}}{\cal R}^{(p)}_{n}\big{[}\tilde{f}_{n};f\big{]} and its minimax risk on is given by
[TABLE]
Here, the infimum is taken over all possible estimators. An estimator whose maximal risk is bounded, up to some constant factor, by , is called minimax on .
Let \big{\{}\mathbb{F}_{\vartheta},\vartheta\in\Theta\big{\}} be a collection of subsets of {\mathbb{L}}_{p}\big{(}{\mathbb{R}}^{d},\nu_{d}\big{)}, where is a nuisance parameter which may have a very complicated structure.
The problem of adaptive estimation can be formulated as follows: is it possible to construct a single estimator which would be simultaneously minimax on each class , i.e.
[TABLE]
We refer to this question as *the problem of minimax adaptive estimation over the scale of * . If such an estimator exists, we will call it optimally adaptive.
From oracle approach to adaptation. Let the oracle inequality (1.3) be established. Define
[TABLE]
We immediately deduce from (1.3) that for any
[TABLE]
Hence, the minimax adaptive optimality of the estimator is reduced to the comparison of the normalization R_{n}\big{(}\mathbb{F}_{\vartheta}\big{)} with the minimax risk . Indeed, if one proves that for any
[TABLE]
then the estimator is optimally adaptive over the scale \big{\{}\mathbb{F}_{\vartheta},\vartheta\in\Theta\big{\}}.
Objectives. In the framework of the convolution structure density model, we will be interested in adaptive estimation over the scale
[TABLE]
where {\mathbb{N}}_{\vec{r},d}\big{(}\vec{\beta},\vec{L}\big{)} is the anisotropic Nikolskii class (its exact definition will be presented in Part II). Here we only mention that for any f\in{\mathbb{N}}_{\vec{r},d}\big{(}\vec{\beta},\vec{L}\big{)} the coordinate of the vector represents the smoothness of in the direction and the coordinate of the vector represents the index of the norm in which is measured. Moreover, {\mathbb{N}}_{\vec{r},d}\big{(}\vec{\beta},\vec{L}\big{)} is the intersection of balls in some semi-metric space and the vector represents the radii of these balls.
The aforementioned dependence on the direction is usually referred to anisotropy of the underlying function and the corresponding functional class. The use of the integral norm in the definition of the smoothness is referred to inhomogeneity of the underlying function. The latter means that the function can be sufficiently smooth on some part of the observation domain and rather irregular on another part. Thus, the adaptive estimation over the scale \big{\{}{\mathbb{N}}_{\vec{r},d}\big{(}\vec{\beta},\vec{L}\big{)},\;\big{(}\vec{\beta},\vec{r},\vec{L}\big{)}\in(0,\infty)^{d}\times[1,\infty]^{d}\times(0,\infty)^{d}\big{\}} can be viewed as the adaptation to anisotropy and inhomogeneity of the function to be estimated.
Additionally, we will consider \mathbb{F}_{\vartheta}={\mathbb{N}}_{\vec{r},d}\big{(}\vec{\beta},\vec{L}\big{)}\cap\mathbb{F}_{g}(R)\cap\mathbb{B}_{\infty,d}(Q),\;\vartheta=\big{(}\vec{\beta},\vec{r},\vec{L},R,Q\big{)}. It will allow us to understand how the boundedness of the underlying function may affect the accuracy of estimation.
The minimax adaptive estimation is a very active area of mathematical statistics, and the theory of adaptation was developed considerably over the past three decades. Several estimation procedures were proposed in various statistical models, such that Efroimovich-Pinsker method, Efroimovich and Pinsker (1984); Efroimovich (1986), Lepski method, Lepskii (1991) and its generalizations, Kerkyacharian et al. (2001), Goldenshluger and Lepski (2009), unbiased risk minimization, Golubev (1992), wavelet thresholding, Donoho et al. (1996), model selection, Barron et al. (1999); Birgé and Massart (2001), blockwise Stein method, Cai (1999), aggregation of estimators, Nemirovski (2000), Wegkamp (2003), Tsybakov (2003), Goldenshluger (2009), exponential weights, Leung and Barron (2006), Dalalyan and Tsybakov (2008), risk hull method, Cavalier and Golubev (2006), among many others. The interested reader can find a very detailed overview as well as several open problems in adaptive estimation in the recent paper, Lepski (2015).
As already said, the convolution structure density model includes itself the density estimation under direct and indirect observations. In Part II we compare in detail our minimax adaptive results to those already existing in both statistical models. Here we only mention that more developed results can be found in Goldenshluger and Lepski (2011), Goldenshluger and Lepski (2014) (density model) and in Comte and Lacour (2013), Rebelles (2016) (density deconvolution).
1.3 Assumption on the function
Later on for any U\in{\mathbb{L}}_{1}\big{(}{\mathbb{R}}^{d}\big{)}, let denote its Fourier transform, defined as The selection rule from the family of kernel estimators, the -norm oracle inequality as well as the adaptive results presented in Part II are established under the following condition.
Assumption 1**.**
(1) if then there exists such that
[TABLE]
(2) if then there exists and such that
[TABLE]
Remind that the following assumption is well-known in the literature:
[TABLE]
It is referred to as a moderately ill-posed statistical problem. In particular, the assumption is satisfied for the centered multivariate Laplace law.
Note that Assumption 1 (1) is very weak and it is verified for many distributions, including centered multivariate Laplace and Gaussian ones. Note also that this assumption always holds with if . Additionally, it holds with if is a real positive function. The latter is true, in particular, for any probability law obtained by an even number of convolutions of a symmetric distribution with itself.
2 Pointwise selection rule and -norm oracle inequality
To present our results in an unified way, let us define , , , . Let be a continuous function belonging to {\mathbb{L}}_{1}\big{(}{\mathbb{R}}^{d}\big{)}, , and such that its Fourier transform satisfies the following condition.
Assumption 2**.**
There exist and such that
[TABLE]
Set {\cal H}=\big{\{}e^{k},\;k\in{\mathbb{Z}}\big{\}} and let {\cal H}^{d}=\big{\{}\vec{h}=(h_{1},\ldots,h_{d}):\;h_{j}\in{\cal H},j=1,\ldots,d\big{\}}. Define for any
[TABLE]
Later on for any the operations and relations , , ,, , are understood in coordinate-wise sense. In particular means that for any .
2.1 Pointwise selection rule from the family of kernel estimators
For any let M\big{(}\cdot,\vec{h}\big{)} satisfy the operator equation
[TABLE]
For any and introduce the estimator \widehat{f}_{\vec{\mathrm{h}}}(x)=n^{-1}\sum_{i=1}^{n}M\big{(}Z_{i}-x,\vec{\mathrm{h}}\big{)}.
Our first goal is to propose for any given a data-driven selection rule from the family of kernel estimators {\cal F}\big{(}{\cal H}^{d}\big{)}=\big{\{}\widehat{f}_{\vec{\mathrm{h}}}(x),\;\vec{\mathrm{h}}\in{\cal H}^{d}\big{\}}. Define for any
[TABLE]
Pointwise selection rule
Let be an arbitrary subset of . For any and introduce
[TABLE]
and define
[TABLE]
Our final estimator is and we will call (2.3) the pointwise selection rule.
Note that the estimator does not necessarily belong to the collection \big{\{}\widehat{f}_{\vec{\mathrm{h}}}(\cdot),\;\vec{\mathrm{h}}\in{\cal H}^{d}\big{\}} since the multi-bandwidth is a -variate function, which is not necessarily constant on . The latter fact allows to take into account the ”local structure” of the function to be estimated. Moreover, is chosen with respect to the observations, and therefore it is a random vector-function.
2.2 -norm oracle inequality
Introduce for any and
[TABLE]
where we have put
[TABLE]
For any , and introduce also
[TABLE]
Theorem 1**.**
Let Assumptions 1 and 2 be fulfilled. Then for any , and ,
[TABLE]
The explicit expression for the constant can be found in the proof of the theorem.
Later on we will pay attention to a special choice for the collection of multi-bandwidths, namely
[TABLE]
More precisely, in Part II, the selection from the corresponding family of kernel estimators will be used for the adaptive estimation over the collection of isotropic Nikolskii classes. Note also that if then obviously for any
[TABLE]
and we come to the following corollary of Theorem 1.
Corollary 1**.**
Let Assumptions 1 and 2 be fulfilled. Then for any and
[TABLE]
The oracle inequality proved in Theorem 1 is particularly useful since it does not require any assumption on the underlying function (except for the restrictions ensuring the existence of the model and of the risk). However, the quantity appearing in the right hand side of this inequality, namely
[TABLE]
is not easy to analyze. In particular, in order to use the result of Theorem 1 for adaptive estimation, one has to be able to compute
[TABLE]
for a given class \mathbb{F}\subset{\mathbb{L}}_{p}\big{(}{\mathbb{R}}^{d}\big{)}\cap\mathbb{F}_{g}(R) with either or . It turns out that under some nonrestrictive assumptions imposed on , the obtained bounds can be considerably simplified. Moreover, the new inequality obtained below will allow us to better understand the way for proving adaptive results.
2.3 Some consequences of Theorem 1
Thus, furthermore we will assume that , where
[TABLE]
and denotes the ball of radius in the weak-type space {\mathbb{L}}_{\mathbf{u},\infty}\big{(}{\mathbb{R}}^{d}\big{)}, i.e.
[TABLE]
As usual and obviously . Note also that for any . It is worth noting that the assumption simply means that the common density of the observations belongs to .
Remark 1**.**
It is easily seen that \mathbb{F}_{g,\infty}\big{(}R,R\|g\|_{\infty}\big{)}=\mathbb{F}_{g}(R) if and . Note also that for any and .
2.3.1 Oracle inequality over
For any and any , let
[TABLE]
[TABLE]
Furthermore let be either or and for any define
[TABLE]
Here is a numerical constant whose explicit expression is given in the beginning of Section 3.2. Introduce for any and
[TABLE]
Remark 2**.**
Note that and whatever the values of and Indeed, for any and one can find such that
[TABLE]
The latter means that . Thus, we conclude that the quantities , and are well-defined for all .
Also, It is easily seen that for any and
[TABLE]
Put at last for any , , where if and if .
Theorem 2**.**
Let the assumptions of Theorem 1 be fulfilled and let be a compactly supported function. Then for any , and any
[TABLE]
Here is a universal constant independent of and . Its explicit expression can be found in the proof of the theorem. We remark also that only this constant depends on .
The result announced in Theorem 2 suggests a way for establishing minimax and minimax adaptive properties of the pointwise selection rule given in (2.3). For a given it mostly consists in finding a careful estimate for
[TABLE]
The choice of is a delicate problem and it depends on .
In the next section we present several results concerning some useful upper estimates for the quantities
[TABLE]
We would like to underline that these bounds will be established for an arbitrary and, therefore, they can be applied to the adaptation over different scales of functional classes. In particular, the results obtained below form the basis for our consideration in Part II.
2.3.2 Application to the minimax adaptive estimation
Our objective now is to bound from above for any . All the results in this section will be proved under an additional condition imposed on the kernel .
Assumption 3**.**
Let be a compactly supported, bounded function and . Then
[TABLE]
Without loss of generality we will assume that and with .
Introduce the following notations. Set for any , and
[TABLE]
where denotes the canonical basis of . For any introduce
[TABLE]
Set for any , and ,
[TABLE]
where \mathbf{c}=(20d)^{-1}\big{[}\max(2c_{\cal K}\|{\cal K}\|_{\infty},\|{\cal K}\|_{1})\big{]}^{-d}. As usual the complement of J\big{(}\vec{h},v\big{)} will be denoted by \bar{J}\big{(}\vec{h},v\big{)}. Furthermore, the summation over the empty set is supposed to be zero.
For any , and introduce
[TABLE]
Theorem 3**.**
Let assumptions of Theorem 2 be fulfilled and suppose additionally that satisfies Assumption 3. Then for any , , , and any
[TABLE]
If additionally one has also
[TABLE]
Moreover, if one has
[TABLE]
Finally, if all the assertions above remain true for any if one replaces in (2.7)–(2.8) by .
It is important to emphasize that depends only on , and . Note also that the assertions of the theorem remain true if we minimize right hand sides of obtained inequalities w.r.t since their left hand sides are independent of and . In this context it is important to realize that is bounded for any but if there exists such that . Contrary to that for any if and it explains in particular the fourth assertion of the theorem.
Note also that are not involved in the construction of our pointwise selection rule. That means that one and the same estimator can be actually applied on any
[TABLE]
Moreover, the assertion of the theorem has a non-asymptotical nature; we do not suppose that the number of observations is large.
Discussion
As we see, the application of our results to some functional class is mainly reduced to the computation of the functions for some properly chosen . Note however that this task is not necessary for many functional classes used in nonparametric statistics, at least for the classes defined by the help of kernel approximation. Indeed, a typical description of can be summarized as follows. Let , be such that for any . Then, the functional class, say \mathbb{F}_{K}\big{[}\vec{\lambda}(\cdot),\vec{r}\big{]} can be defined as a collection of functions satisfying
[TABLE]
for some . It yields obviously
[TABLE]
and the result of Theorem 3 remains valid if we replace formally by in all the expressions appearing in this theorem. In Part II we show that for some particular kernel , the anisotropic Nikol’skii class {\mathbb{N}}_{\vec{r},d}\big{(}\vec{\beta},\vec{L}\big{)} is included into the class defined by (2.9) with , whatever the values of and .
Denote and remark that in many cases for any for some class parameter and . Then, replacing by in (2.7) and (2.8) and choosing we come to the quantities \boldsymbol{\Lambda}\big{(}v,\mathbf{u},\vartheta\big{)} and \boldsymbol{\Lambda}_{\mathbf{q}}\big{(}v,\vartheta\big{)}, completely determined by the functions , the vector and the number . Therefore, putting
[TABLE]
we deduce from the first and the second assertions of Theorem 3 for any and and
[TABLE]
Since the estimator is completely data-driven and, therefore, is independent of and , the bound (2.10) holds for the scale of functional classes \big{\{}\mathbb{F}_{K}[\vartheta]\big{\}}_{\vartheta}.
If \phi_{n}\big{(}\mathbb{F}_{K}[\vartheta]\big{)} is the minimax risk defined in (1.4) and
[TABLE]
we can assert that our estimator is optimally adaptive over the considered scale \big{\{}\mathbb{F}_{K}[\vartheta],\;\vartheta\in\Theta\big{\}}.
To illustrate the powerfulness of our approach, let us consider a particular scale of functional classes defined by (2.9).
Classes of Hölderian type
Let and be given vectors.
Definition 1**.**
We say that a function belongs to the class \mathbb{F}_{K}\big{(}\vec{\beta},\vec{L}\big{)}, where satisfies Assumption 3, if f\in\mathbb{B}_{\infty,d}\big{(}max_{j=1,\ldots,d}L_{j}\big{)} and for any
[TABLE]
We remark that this class is a particular case of the one defined in (2.9), since it corresponds to and for any . Moreover let us introduce the following notations
[TABLE]
Then the following result is a direct consequence of Theorem 3. Its simple and short proof is postponed to Section 3.4.
Assertion 1**.**
Let the assumptions of Theorem 3 be fulfilled. Then for any , , , and there exists independent of such that
[TABLE]
where we have denoted
[TABLE]
It is interesting to note that the obtained bound, being a very particular case of our consideration in Part II, is completely new if . As we already mentioned, for some particular choice of the kernel , the anisotropic Nikol’skii class {\mathbb{N}}_{\vec{r},d}\big{(}\vec{\beta},\vec{L}\big{)} is included in the class \mathbb{F}_{K^{*}}\big{[}\vec{\lambda}(\cdot),\vec{r}\big{]} with , whatever the values of and . Therefore, the aforementioned result holds on an arbitrary Hölder class {\mathbb{N}}_{\vec{\infty},d}\big{(}\vec{\beta},\vec{L}\big{)}. Comparing the result of Assertion 1 with the lower bound for the minimax risk obtained in Lepski and Willer (2017), we can state that it differs only by some logarithmic factor. Using the modern statistical language, we say that the estimator is nearly optimally-adaptive over the scale of Hölder classes.
3 Proofs
3.1 Proof of Theorem 1
The main ingredients of the proof of the theorem are given in Proposition 1. Their proofs are postponed to Section 3.1.2. Introduce for any
[TABLE]
Proposition 1**.**
Let Assumptions 1 and 2 be fulfilled. Then for any and any
[TABLE]
[TABLE]
The explicit expression of constant and can be found in the proof.
3.1.1 Proof of the theorem
We start by proving the so-called pointwise oracle inequality.
Pointwise oracle inequality. Let and be fixed. We have in view of the triangle inequality
[TABLE]
First, note that obviously and, therefore,
[TABLE]
Moreover by definition, \widehat{U}_{n}\big{(}x,\vec{\eta}\big{)}\leq\widehat{U}^{*}_{n}\big{(}x,\vec{\eta}\big{)} for any .
Next, for any we have obviously \widehat{U}_{n}\big{(}x,\vec{h}\vee\vec{\eta}\big{)}\leq\widehat{U}^{*}_{n}\big{(}x,\vec{h}\big{)}\wedge\widehat{U}^{*}_{n}\big{(}x,\vec{\eta}\big{)}. Thus, we obtain
[TABLE]
Similarly we have
[TABLE]
The definition of implies that for any
[TABLE]
and we get from (3.1), (3.2) and (3.3) for any
[TABLE]
We obviously have for any
[TABLE]
Note that for any
[TABLE]
in view of the structural assumption (1.1) imposed on the density . Note that
[TABLE]
and, therefore, in view of the definition of M\big{(}\cdot,\vec{h}\big{)}, c.f. (2.2), we obtain for any
[TABLE]
We deduce from (3.5) that
[TABLE]
and, therefore, for any
[TABLE]
Set for any and any
[TABLE]
We obtain in view of (3.6) that for any (since obviously for any )
[TABLE]
Note also that in view of the obvious inequality
[TABLE]
We get from (3.4), (3.7) and (3.8)
[TABLE]
It remains to note that
[TABLE]
and we obtain for any and
[TABLE]
Noting that the left hand side of the latter inequality is independent of we obtain for any
[TABLE]
This is the pointwise oracle inequality.
Application of Proposition 1. Set for any
[TABLE]
Applying Proposition 1 we obtain in view of (3.9) and the triangle inequality
[TABLE]
where . The theorem is proved.
3.1.2 Proof of Proposition 1
Since the proof of the proposition is quite long and technical, we divide it into several steps.
Preliminaries
We start the proof with the following simple remark. Let \check{M}\big{(}t,\vec{h}\big{)},t\in{\mathbb{R}}^{d}, denote the Fourier transform of M\big{(}\cdot,\vec{h}\big{)}. Then, we obtain in view of the definition of M\big{(}\cdot,\vec{h}\big{)}
[TABLE]
Note that Assumptions 1 and 2 guarantee that \check{M}\big{(}\cdot,\vec{h}\big{)}\in{\mathbb{L}}_{1}\big{(}{\mathbb{R}}^{d}\big{)}\cap{\mathbb{L}}_{2}\big{(}{\mathbb{R}}^{d}\big{)} for any and, therefore,
[TABLE]
Thus, putting
[TABLE]
we obtain in view of Assumptions 1 and 2 for any
[TABLE]
where M_{2}=\big{[}(2\pi)^{-d}\big{\{}\varepsilon^{-1}\big{\|}\check{K}\big{\|}_{2}\mathrm{1}_{\alpha\neq 1}+\Upsilon_{0}^{-1}\mathbf{k}_{2}\mathrm{1}_{\alpha=1}\big{\}}\big{]}\vee 1. Additionally we deduce from (3.11)
[TABLE]
Let {\cal L}\big{(}\cdot,\vec{h}\big{)} be either M\big{(}\cdot,\vec{h}\big{)} or M^{2}\big{(}\cdot,\vec{h}\big{)} and let {\cal L}_{\infty}\big{(}\vec{h}\big{)} denote either {\cal M}_{\infty}\big{(}\vec{h}\big{)} or {\cal M}^{2}_{\infty}\big{(}\vec{h}\big{)}.
We have in view of (3.11)
[TABLE]
Additionally, we get from (3.11) and (3.12)
[TABLE]
Set \sigma^{{\cal L}}\big{(}x,\vec{h}\big{)}=\sqrt{\int_{{\mathbb{R}}^{d}}{\cal L}^{2}\big{(}t-x,\vec{h}\big{)}\mathfrak{p}(t)\nu_{d}({\rm d}t)} and note that in view of (3.14) for any
[TABLE]
Next, we have in view of (3.13)
[TABLE]
Define for any and
[TABLE]
where remind \lambda_{n}\big{(}\vec{h}\big{)}=4\ln(M_{\infty})+6\ln{(n)}+(8p+26)\sum_{j=1}^{d}\big{[}1+\boldsymbol{\mu}_{j}(\alpha)\big{]}\big{|}\ln(h_{j})\big{|}.
Noting that for any we deduce from (3.16) z_{n}\big{(}x,\vec{h}\big{)}\leq\lambda_{n}\big{(}\vec{h}\big{)} for any and, therefore, for any
[TABLE]
First step
Let and be fixed and put .
We obtain for any and by the integration of the Bernstein inequality
[TABLE]
where is the Gamma-function.
Choose z=z_{n}\big{(}x,\vec{h}\big{)}. Noting that for any and
[TABLE]
and taking into account that for any , we get
[TABLE]
Here to get the second inequality we have used (3.13) and put .
Set {\cal X}\big{(}\vec{h}\big{)}=\big{\{}x\in{\mathbb{R}}^{d}:\;\sigma^{{\cal L}}\big{(}x,\vec{h}\big{)}\geq n^{-3/2}{\cal L}_{\infty}\big{(}\vec{h}\big{)}\big{\}}, \bar{{\cal X}}\big{(}\vec{h}\big{)}={\mathbb{R}}^{d}\setminus{\cal X}\big{(}\vec{h}\big{)} and later on the integration over the empty set is supposed to be zero.
We have in view of (3.17), (3.15) and (3.1.2) applied with that for any
[TABLE]
where .
Introduce the following notations. For any set
[TABLE]
and introduce the random event D\big{(}x,\vec{h}\big{)}=\Big{\{}\sum_{i=1}^{n}\Psi_{i}\big{(}x,\vec{h}\big{)}\geq 2\Big{\}}. As usual, the complimentary event will be denoted by \bar{D}\big{(}x,\vec{h}\big{)}. Set finally \pi\big{(}x,\vec{h}\big{)}={\mathbb{P}}_{f}\big{\{}\Psi_{1}\big{(}x,\vec{h}\big{)}=1\big{\}}.
We obviously have
[TABLE]
and, therefore,
[TABLE]
Applying Cauchy-Schwartz inequality, we deduce from (3.20) that
[TABLE]
Using (3.1.2) with and (3.13) we obtain for any x\in\bar{{\cal X}}\big{(}\vec{h}\big{)}
[TABLE]
where we have put C^{(3)}_{p}=\big{[}C^{(1)}_{2p}\big{]}^{\frac{1}{2}}M_{\infty}^{2}.
For any we have in view of the exponential Markov inequality
[TABLE]
We get applying the Tchebychev inequality \pi\big{(}x,\vec{h}\big{)}\leq n^{2}{\cal L}^{-2}_{\infty}\big{(}\vec{h}\big{)}\big{[}\sigma^{{\cal L}}\big{(}x,\vec{h}\big{)}\big{]}^{2}. It yields
[TABLE]
Note that the definition of \bar{{\cal X}}\big{(}\vec{h}\big{)} implies n^{3}{\cal L}^{-2}_{\infty}\big{(}\vec{h}\big{)}\big{[}\sigma^{{\cal L}}\big{(}x,\vec{h}\big{)}\big{]}^{2}<1 for any x\in\bar{{\cal X}}\big{(}\vec{h}\big{)}. Hence, choosing \lambda=\ln 2-2\ln{\big{\{}n^{3/2}{\cal L}^{-1}_{\infty}\big{(}\vec{h}\big{)}\sigma^{{\cal L}}\big{(}x,\vec{h}\big{)}\big{\}}} we have
[TABLE]
It yields, together with (3.13), (3.15) and (3.21) and for any
[TABLE]
where . Putting and noting that we obtain from (3.19) and (3.22) for any
[TABLE]
Choosing and we get from (3.23) and the definition of
[TABLE]
The first assertion of the proposition follows from (3.24) with
Second step
Denoting \chi\big{(}x,\vec{h}\big{)}=\big{\{}\big{|}\widehat{\sigma}^{2}\big{(}x,\vec{h}\big{)}-\sigma^{2}\big{(}x,\vec{h}\big{)}\big{|}-\mathfrak{U}_{n}\big{(}x,\vec{h}\big{)}\big{\}}_{+}, where
[TABLE]
and choosing and , we get from (3.23)
[TABLE]
Note that \sigma^{M^{2}}\big{(}x,\vec{h}\big{)}\leq{\cal M}_{\infty}\big{(}\vec{h}\big{)}\sigma\big{(}x,\vec{h}\big{)} and, therefore, for any and any
[TABLE]
This implies,
[TABLE]
where we have denoted \chi^{*}(x,\vec{h}\big{)}={\cal M}^{-1}_{\infty}\big{(}\vec{h}\big{)}\chi(x,\vec{h}\big{)}. Hence
[TABLE]
By the same reason
[TABLE]
Note that the definition of \widehat{U}_{n}\big{(}x,\vec{h}\big{)} and U_{n}\big{(}x,\vec{h}\big{)} implies that
[TABLE]
Using the inequality , we get from (3.27), (3.28) and (3.29)
[TABLE]
Choosing in the first inequality and in the second we get for any and
[TABLE]
Remembering that we obtain from (3.30), (3.31), (3.25) and (3.13) for any
[TABLE]
The second and third assertions follow from (3.32) and (3.33) with
3.2 Proof of Theorem 2
Let . Introduce the following notations:
[TABLE]
where , and c_{3}=2\max\big{\{}4\ln(M_{\infty}),(8p+26)\max_{j=1,\ldots,d}[1+\boldsymbol{\mu}_{j}(\alpha)]\big{\}}.
3.2.1 Preliminaries
Recall that for any locally integrable function its strong maximal function is defined as
[TABLE]
where the supremum is taken over all possible rectangles in with sides parallel to the coordinate axes, containing point .
It is well known that the strong maximal operator is of the strong –type for all , i.e., if then and there exists a constant depending on only such that
[TABLE]
Let be defined by (3.34), where, instead of rectangles, the supremum is taken over all possible cubes in with sides parallel to the coordinate axes, containing point . Then, it is known that is of the weak -type, i.e. there exists depending on only such that for any
[TABLE]
The results presented below deal with the weak property of the strong maximal function. The following inequality can be found in Guzman (1975). There exists a constant depending on only such that
[TABLE]
where for all , .
Lemma 1**.**
For any given , and there exists such that for any
[TABLE]
The proof of the lemma is an elementary consequence of the aforementioned result and can be omitted.
Recall also the particular case of the Young inequality for weak-type spaces, see Grafakos (2008), Theorem 1.2.13. For any there exists such that for any \lambda_{1}\in{\mathbb{L}}_{1}\big{(}{\mathbb{R}}^{d}\big{)} and \lambda_{2}\in{\mathbb{L}}_{\mathbf{u},\infty}\big{(}{\mathbb{R}}^{d}\big{)} one has
[TABLE]
Auxiliary results
Let us prove several simple facts. First note that for any for any
[TABLE]
Second it is easy to see that for any any ,
[TABLE]
where . Since implies and if , we have
[TABLE]
Then by (3.38) and the second inequality in (3.39), we have:
[TABLE]
Now let us establish two bounds for \|U^{*}_{n}\big{(}\cdot,\vec{h}\big{)}\|_{\infty}.
Let . We have in view of the second inequality in (3.11) for any
[TABLE]
It yields for any in view of the first inequality in (3.39)
[TABLE]
Then gathering (3.40), (3.41) and by definition of , we have
[TABLE]
Another bound for \|U^{*}_{n}\big{(}\cdot,\vec{h}\big{)}\|_{\infty} is available regardless of the value of . Indeed for any in view of the first inequality in (3.11)
[TABLE]
It yields for any and any
[TABLE]
Then gathering with (3.40) again, we have \|U^{*}_{n}\big{(}\cdot,\vec{h}\big{)}\|_{\infty}\leq\big{[}(\sqrt{2c_{3}n}M_{\infty})\vee(c_{2}c_{3})\big{]}G_{n}\big{(}\vec{h}\big{)} for any and, therefore,
[TABLE]
To get this it suffices to choose and to make tend to infinity.
Let now . Let us prove that for any , and any
[TABLE]
where we have put {\cal U}^{2}_{n}\big{(}\cdot,\vec{\eta},f\big{)}=2n^{-1}\lambda_{n}\big{(}\vec{\eta}\big{)}\sigma^{2}\big{(}\cdot,\vec{\eta}\big{)} and if and if .
Indeed, if , applying the Markov inequality, we obtain in view of the second inequality in (3.11) for any
[TABLE]
Here we have put and to get the last inequality we have used (3.38).
To get the similar result if we remark that \sigma^{2}\big{(}\cdot,\vec{\eta}\big{)}=M^{2}\big{(}\cdot,\vec{\eta}\big{)}\star\mathfrak{p}(\cdot) and that M^{2}\big{(}\cdot,\vec{\eta}\big{)}\in{\mathbb{L}}_{1}\big{(}{\mathbb{R}}^{d}\big{)} in view of the second inequality in (3.11). It remains to note that implies and to apply the inequality (3.37).
It yields together with the second inequality in (3.11) for any
[TABLE]
Thus, denoting if and if , we get from (3.45) and (3.46)
[TABLE]
It remains to note that since and we can write with for any . It yields together with the first inequality in (3.39)
[TABLE]
Hence, (3.44) with follows from (3.47).
Let be such that . We have
[TABLE]
If , the latter inequality holds with instead of . Thus,
[TABLE]
where we have denoted if and if .
Moreover, we deduce from (3.48) and (3.43) putting T_{\vec{h}}(x,f)={\cal B}_{\vec{h}}(x,f)+49U^{*}_{n}\big{(}\cdot,\vec{h}\big{)} that
[TABLE]
3.2.2 Proof of the theorem
For any set {\cal C}_{v}(f)=\big{\{}x\in{\mathbb{R}}^{d}:\;\mathbf{T}(x,f)\geq v\big{\}}, where we have put . For any given one obviously has
[TABLE]
Denoting {\cal W}_{v}(\vec{h},f)=\{x\in{\mathbb{R}}^{d}:\;49U^{*}_{n}\big{(}x,\vec{h}\big{)}\geq 2^{-1}v\} we obviously have for any and
[TABLE]
The last inequality follows from (3.49). Set {\cal U}^{*}_{n}\big{(}x,\vec{h},f\big{)}=\sup_{\vec{\eta}\in{\cal H}^{d}:\;\vec{\eta}\geq\vec{h}}\;{\cal U}_{n}\big{(}x,\vec{\eta},f\big{)}.
Noting that U^{*}_{n}\big{(}x,\vec{h}\big{)}\leq{\cal U}^{*}_{n}\big{(}x,\vec{h},f\big{)}+(196a)^{-1}G_{n}\big{(}\vec{h}\big{)} in view of (3.40), we get
[TABLE]
Applying (3.44) with we deduce from (3.51) that
[TABLE]
Noting that the left hand side of the latter inequality is independent of we get
[TABLE]
Let us establish the following bounds, where is given in the paragraph below.
For any ,
[TABLE]
and for any ,
[TABLE]
Let . We remark that for any in view of (3.42). Thus, we deduce from (3.51), (3.52) and (2.6), taking into account that the left hand sides of both inequalities are independent of
[TABLE]
This inequality and (3.55) ensure that (3.56) and (3.57) hold if .
Let . Applying (3.44) with , we obtain in view of (3.54)
[TABLE]
It yields together with (3.51)
[TABLE]
This inequality and (3.55) ensure that (3.56) holds if .
What is more, we have in view of (3.40) and (3.54) for any
[TABLE]
Moreover, applying (3.44) with , we have for any and
[TABLE]
Hence, if additionally , we have for any
[TABLE]
This yields together with (3.52)
[TABLE]
This inequality ensures that (3.57) holds if .
Recall that implies that . Since additionally , , Lemma 1 as well as (3.36) is applicable and we obtain in view of (3.53)
[TABLE]
It yields for any and
[TABLE]
In the case of the last inequality is obvious and if it follows by integration by parts. The assertion of the theorem follows now from (3.50), where the bound (3.61) is used for any , the estimate (3.56) for any and the bound (3.57) with .
3.3 Proof of Theorem 3
The proof of the theorem is based essentially on some auxiliary statements formulated in Section 3.3.1 below.
Some properties related to the kernel approximation of the underlying function are summarized in Lemma 2 and in formulae (3.62). The results presented in Lemma 1 and in formulae (3.63) deal with the properties of the strong maximal function. In the subsequent proof , stand for constants depending only on , and .
3.3.1 Auxiliary results
Let denote the set of all the subsets of endowed with the empty set . For any and set and we will write y=\big{(}y_{J},y_{\bar{J}}\big{)}, where as usual .
For any introduce the matrix where, recall, denotes the canonical basis of . Set also . Later on denotes the matrix with zero entries.
To any and any associate the function
[TABLE]
with the obvious agreement if , which is always the case if .
For any and set K_{\vec{h},J}(u_{J})=\prod_{j\in J}h^{-1}_{j}{\cal K}\big{(}u_{j}/h_{j}\big{)} and define for any
[TABLE]
where is the Lebesgue measure on . For any set
[TABLE]
Lemma 2**.**
Let Assumption 3 hold. One can find and a collection of indexes \big{\{}j_{1}<j_{2}<\cdots<j_{k}\big{\}}\in\{1,\ldots,d\} such that for any and any
[TABLE]
The proof of the lemma can be found in Lepski (2015), Lemma 2.
Also, let us mention the following bound which is a trivial consequence of the Young inequality and the Fubini theorem. If then for any
[TABLE]
To any and any locally integrable function we associate the operator
[TABLE]
where the supremum is taken over all hyper-rectangles in containing and with sides parallel to the axis.
As we see is the strong maximal operator applied to the function obtained from by fixing the coordinates whose indices belong to . It is obvious that and .
The following result is a direct consequence of (3.35) and of the Fubini theorem. For any there exists such that for any \lambda\in{\mathbb{L}}_{\mathbf{t}}\big{(}{\mathbb{R}}^{d})
[TABLE]
Obviously this inequality holds if with .
3.3.2 Proof of the theorem
We start with the following obvious observation. For any , and
[TABLE]
Putting we get for any and in view of (3.65) and assertions of Lemma 2 that
[TABLE]
Thus noting that the right hand side of the first inequality above is independent of , we obtain
[TABLE]
Applying (3.64) with , we have for any in view of the definition of
[TABLE]
We obtain for any , and , applying consecutively the Markov inequality and (3.64) with ,
[TABLE]
Noting that the right hand side of the latter inequality is independent of and the left hand side is independent of , we get
[TABLE]
Note also that in view of (3.67), we have for any
[TABLE]
For any and introduce
[TABLE]
Noting that in view of (3.67) for any and any
[TABLE]
we deduce from (3.3.2) that for any
[TABLE]
It remains to note that similarly (3.68) for any
[TABLE]
and to apply (3.64) with to the each term in the sum appeared in (3.3.2). All of this together with (3.68), applied with yields for any and
[TABLE]
Noting that the right hand side of the latter inequality is independent of and the left hand side is independent of , the we get
[TABLE]
The first assertion of the theorem follows from (3.69), (3.72) and Theorem 2.
Remark that in view of (3.48) and (3.35) implies
[TABLE]
where is the constant which appeared in (3.35). Hence for any and
[TABLE]
Remind that , whatever and , see Remark 2. Hence, in view of (3.74) for any
[TABLE]
It remains to note that the right hand side of the obtained inequality is independent of and the second assertion of the theorem follows from this inequality, (3.69) and Theorem 2.
Since we obtain in view of (3.73) for all
[TABLE]
It yields for any in view of (3.68) if
[TABLE]
Since the left hand side of the obtained inequality is independent of and the left hand side is independent of we conclude that
[TABLE]
The third assertion of the theorem follows now from (3.69), (3.75) and Theorem 2.
We have already seen (Corollary 1), that if . Therefore by definition of :
[TABLE]
where, remind . We remark that (3.76) is similar to (3.66) but the maximal operator is not involved in this bound. This, in its turn, allows to consider .
Indeed, similarly to (3.67) we have for any , applying (3.62) with
[TABLE]
We obtain for any , and applying consecutively the Markov inequality and (3.62) with
[TABLE]
We note that the obtained inequality coincides with (3.68) if one replaces by . It remains to remark that . Indeed,
[TABLE]
Therefore, by the monotone convergence theorem and the triangle inequality for any
[TABLE]
The fourth statement of the theorem follows now from (3.69), (3.72), (3.74) and Theorem 2.
3.4 Proof of Assertion 1
Obviously \mathbb{F}_{K}\big{(}\vec{\beta},\vec{L}\big{)}\subset\mathbb{B}_{\infty,d}(L_{\infty}). Thus, we can choose and , which implies . For any let \vec{\boldsymbol{h}}(v)=\big{(}\boldsymbol{h}_{1}(v),\ldots,\boldsymbol{h}_{d}(v)\big{)}, where
[TABLE]
and is chosen to satisfy . This in its turn implies .
This choice of together with the definition of the class \mathbb{F}_{K}\big{(}\vec{\beta},\vec{L}\big{)} implies that
[TABLE]
Moreover, there exists T_{1}:=T_{1}\big{(}\vec{\beta}\big{)}<\infty independent of such that
[TABLE]
Then set .
We have in view of (3.79) and (3.80) for all large enough and any
[TABLE]
Setting and we obtain in view of (3.81) and (3.82) for all large enough
[TABLE]
It is worth noting that , which implies , and . Choose
[TABLE]
Since for any in view of (3.83), we deduce from (3.78) and (3.81) for any
[TABLE]
This, in its turn, yields for any
[TABLE]
where we have denoted T_{5}=T_{2}^{2}\big{\{}1\vee|p-2-1/\beta(\alpha)|^{-1}\big{\}} and
[TABLE]
Moreover, since in view of (3.83), we deduce from (3.78) that for any
[TABLE]
At last, putting , we obtain
[TABLE]
Applying the third assertion of Theorem 3, we deduce from (3.84), (3.85) and (3.86) that
[TABLE]
After elementary computations we come to the statement of Assertion 1.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Barron et al. (1999) Barron, A. , Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 , 301–413.
- 3Birgé and Massart (2001) Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 , 3, 203 -268.
- 4Cai (1999) Cai, T. T. (1999). Adaptive wavelet estimation: a block thresholding and oracle inequality approach. Ann. Statist. 27 , 3, 898–924.
- 5Cavalier and Golubev (2006) Cavalier, L. and Golubev, G.K. (2006). Risk hull method and regularization by projections of ill-posed inverse problems. Ann. Statist. 34 , 1653–1677.
- 6Comte and Lacour (2013) Comte, F. and Lacour, C. (2013). Anisotropic adaptive kernel deconvolution. Ann. Inst. H. Poincaré Probab. Statist. 49 , 2, 569–609.
- 7Dalalyan and Tsybakov (2008) Dalalyan, A. and Tsybakov, A.B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72 , 39–61.
- 8Donoho et al. (1996) Donoho, D. L. , Johnstone, I. M. , Kerkyacharian, G. and Picard, D. (1996). Density estimation by wavelet thresholding. Ann. Statist. 24 , 508–539.
