A combined strategy for multivariate density estimation
Alejandro Cholaquidis, Ricardo Fraiman, Badih Ghattas, Juan, Kalemkerian

TL;DR
This paper introduces a novel non-linear aggregation method for multivariate density estimation that improves accuracy by considering neighborhoods of estimated level sets, supported by theoretical and simulation results.
Contribution
It proposes a new density estimation strategy based on level set neighborhoods, addressing computational challenges of existing methods and demonstrating improved mean squared error.
Findings
Lower mean squared error compared to traditional aggregation methods
Theoretical proof of a Central Limit Theorem for the estimator
Validated effectiveness through simulation studies
Abstract
Non-linear aggregation strategies have recently been proposed in response to the problem of how to combine, in a non-linear way, estimators of the regression function (see for instance \cite{biau:16}), classification rules (see \cite{ch:16}), among others. Although there are several linear strategies to aggregate density estimators, most of them are hard to compute (even in moderate dimensions). Our approach aims to overcome this problem by estimating the density at a point using not just sample points close to but in a neighborhood of the (estimated) level set . We show, both theoretically and through a simulation study, that the mean squared error of our proposal is smaller than that of the aggregated densities. A Central Limit Theorem is also proven.
| 2000 | 4000 | 2000 | 4000 | |
|---|---|---|---|---|
| Kernel | G | G | G | G |
| 0.090 | 0.185 | 0.118 | 0.269 | |
| 0.111 | 0.240 | 0.122 | 0.305 | |
| 0.110 | 0.233 | 0.123 | 0.300 | |
| 0.111 | 0.231 | 0.125 | 0.301 | |
| 0.113 | 0.232 | 0.128 | 0.307 | |
| 0.115 | 0.235 | 0.131 | 0.316 | |
| 0.093 | 0.200 | 0.100 | 0.256 | |
| 0.095 | 0.201 | 0.103 | 0.261 | |
| 0.092 | 0.200 | 0.098 | 0.255 | |
| 0.101 | 0.211 | 0.112 | 0.282 | |
| 0.104 | 0.218 | 0.117 | 0.297 | |
| , | |||
|---|---|---|---|
| 2000 | 2000 | 40000 | |
| Kernel | E | E | E |
| 0.069 | 0.653 | 0.009 | |
| 0.119 | 0.678 | 0.026 | |
| 0.113 | 0.677 | 0.024 | |
| 0.108 | 0.676 | 0.021 | |
| 0.103 | 0.675 | 0.019 | |
| 0.098 | 0.674 | 0.018 | |
| 0.086 | 0.672 | 0.018 | |
| 0.082 | 0.671 | 0.017 | |
| 0.078 | 0.694 | 0.020 | |
| 0.074 | 0.670 | 0.014 | |
| 0.072 | 0.670 | 0.012 |
| d=2 | d=2 | ||
| , | |||
| 2000 | 2000 | 4000 | |
| Kernel | E | E | E |
| 0.023 | 0.082 | 0.008 | |
| 0.034 | 0.126 | 0.050 | |
| 0.033 | 0.120 | 0.045 | |
| 0.032 | 0.114 | 0.041 | |
| 0.030 | 0.108 | 0.037 | |
| 0.029 | 0.104 | 0.034 | |
| 0.027 | 0.094 | 0.036 | |
| 0.026 | 0.089 | 0.032 | |
| 0.029 | 0.085 | 0.035 | |
| 0.024 | 0.086 | 0.026 | |
| 0.024 | 0.089 | 0.024 |
| 0.071 | 0.063 | |
| 0.105 | 0.113 | |
| 0.103 | 0.109 | |
| 0.102 | 0.106 | |
| 0.100 | 0.103 | |
| 0.099 | 0.101 | |
| 0.096 | 0.094 | |
| 0.095 | 0.092 | |
| 0.094 | 0.107 | |
| 0.093 | 0.088 | |
| 0.093 | 0.086 | |
| Min. | 1st. Qu | Median | Mean | Var | 3rd Qu | Max |
| -0.946 | -0.223 | 0.012 | 0.010 | 0.113 | 0.236 | 1.156 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Statistical Methods and Inference · Face and Expression Recognition
**A combined strategy for multivariate density estimation
**
Alejandro Cholaquidisa, Ricardo Fraimana, Badih Ghattasb
and Juan Kalemkeriana
a Universidad de la República, Uruguay
b Aix Marseille Université, CNRS, Marseille, France.
Abstract
Non-linear aggregation strategies have recently been proposed in response to the problem of how to combine, in a non-linear way, estimators of the regression function (see for instance Biau et al (2016)), classification rules (see Cholaquidis et al (2016)), among others. Although there are several linear strategies to aggregate density estimators, most of them are hard to compute (even in moderate dimensions). Our approach aims to overcome this problem by estimating the density at a point using not just sample points close to but in a neighborhood of the (estimated) level set . We show, both theoretically and through a simulation study, that the mean squared error of our proposal is smaller than that of the aggregated densities. A Central Limit Theorem is also proven.
1 Introduction
Density estimation is still an important and active area of research that has many statistical applications, particularly in supervised and unsupervised learning, see for instance the recent book by Chacón and Duong (2018). Although this is a well-studied subject, when the data belongs to high or even moderate dimensions, such as or , this becomes a difficult problem due to the well-known curse of dimensionality. This is also the case for non-parametric regression. For this last problem, Biau et al (2016) propose a non-linear aggregation method that is very close in spirit to our approach. In Cholaquidis et al (2016), the authors propose a similar idea for classification. To tackle this problem, we introduce a new non-linear aggregation method that is well designed for moderate dimensions. Our approach is based on two main ideas:
The first idea is to compute the estimator of at the point using an estimator of a -neighborhood of the level set, i.e, instead of a neighborhood of the point , see the right-hand panel of Figure 1 and also see Figure 2. Roughly speaking, under the unrealistic case where is known, the estimator that we propose behaves as if the data were in one dimension. In general is unknown, consequently a loss of efficiency will appear, which is related to the estimation of the -neighborhood.
- 2)
The second idea is to perform a nonlinear aggregation method to combine several estimators. This will improve the behavior when, for instance, the underlying true density is not unimodal and the concentration of mass varies significantly within its support, see Figure 1.
Similar ideas have previously been considered for density estimation and non-parametric regression. With respect to 1), a related approach can be found in Fraiman et al (1997), where it is assumed that the density has a particular shape given by the composition of a univariate density with a depth. The particular case of ellipsoidal density has also been considered in Stute and Werner (1991). In our setup, no particular structure is required to the multivariate density.
Starting from the seminal work by Breiman (1996), many linear aggregation methods have been developed, see for instance Lecué (2006); Rigollet and Tsybakov (2007); Bourel and Ghattas (2013); Bellec (2017) and the references therein.
The rest of this paper is organised as follows. In Section 2, we introduce the notation and main definitions used through the manuscript. In Section 3, we define a nonlinear aggregation estimator that is based on a family of density estimators , which requires perfect match of the sets for , see (1). It can also be relaxed to a partial matching because it will be defined later. In Subsection 3.3, we prove that the aggregated strategy is asymptotically optimal in the sense that it behaves as well as the best density estimator within the family. In Subsection 3.4 we prove consistency in , under mild regularity conditions on . A Central Limit Theorem is proven in Subsection 3.5 for the case . Lastly, in Section 4 we perform some simulations in dimensions and , which illustrate the good performance of our approach.
2 Notation
Let us consider endowed with the -dimensional Lebesgue measure . For , denotes the open ball of radii , and . Given , we will denote by the parallel set of radius of , that is where , and denotes the Euclidean norm. Given a kernel function we say that is regular if there exists such that , where stands for the indicator function of the set . We will denote and .
3 The combined estimator
Throughout this manuscript, we will assume that is a density, bounded from above, such that . Let be iid random vectors with the same distribution as . We split into two disjoint subsets, namely and with . Let be density estimators computed with the first sample .
For , we define the combined neighborhood of radius , , of a given point to be
[TABLE]
Let us consider the estimator of , given by
[TABLE]
Lastly, the aggregated density estimator is defined as
[TABLE]
3.1 A smoothed approach
Instead of the indicator function used in (2), we can use a one dimensional kernel . Define,
[TABLE]
where fulfils,
[TABLE]
Then the alternative aggregated estimator is defined as,
[TABLE]
3.2 An alternative approach
Let and , define the -neighborhood of radius , , of a given point to be
[TABLE]
Observe that the for we get . We define the -density estimator, as in (4) replacing with . Regarding (6) we can define
[TABLE]
3.3 Optimality
The following proposition (which is the analogous for our setup of Proposition 2.1 in Biau et al (2016)), states that the combined estimator behaves as well as the best density estimator, except for the second term, which will be proven to converge to 0 (see Theorem 1).
Proposition 1
With the notation introduced previously
[TABLE]
where is independent of and
Remark 1
It is easy to see that Proposition 1 holds for .
Now we will state two Lemmas (whose proofs are given in the Appendix), the first proves that the theoretical estimator converges in to , as . The second proves that under point-wise consistency of the density estimators, for all , with probability one as for almost all . To do this, let us introduce the following condition,
- K1
A random variable with distribution and density fulfils , if for all .
Lemma 1
Under , if is any sequence of functions (possibly random) such a.s then,
[TABLE]
Lemma 2
Let be random variable with distribution whose density is continuous. Let be an iid sample of and be continuous density estimators (built from ), such that for all , a.s., as for almost all w.r.t to . Let , then for all such that
- •
* for , a.s., as .*
- •
* as .*
- •
* is compact, and is compact a.s.*
we have
[TABLE]
and
[TABLE]
3.4 Consistency
Because the first term in the right-hand side of (8) does not depend on and converges to [math] if at least one of the density estimators is mean square error consistent, to prove the consistency (taking limit first in and second in ) for the aggregated estimator, we only need to prove that the second term in the right-hand side of (8) converges to [math] in mean square error. This is done in the following Theorem, under mild regularity restrictions on , as well as point-wise convergence for the density estimators and uniform equicontinuity. Recall that a sequence of functions is said to be uniformly equicontinuous if for all there exists such that for all , , whenever . All of the proofs of this section are given in the Appendix.
3.4.1 Assumptions
We will consider the following set of assumptions
- H1
The density estimators based on a sample fulfils H1 if with probability one, the sequences are uniformly equicontinuous and the of the uniform equicontinuity is bounded from below by .
- H2
The density estimators based on a sample fulfils H2 if for almost all w.r.t. , , a.s., for all as .
Theorem 1
Let us assume K1, H1 and H2. We assume also that, for all such that for all , there exists such that for all , the set is compact, the set is compact a.s., and as . Let as , then,
[TABLE]
Theorem 2
Under the hypotheses of Theorem 1. If is a kernel function, bounded from above by , that fulfils (5), then
[TABLE]
Remark 2
Corollary 1 Einmahl and Mason (2005)* proves that if is uniformly continuous (with some regularity conditions on the kernel ), then the multidimensional kernel density estimator converges almost surely, uniformly, by choosing a suitable bandwidth. It is easy to see that this entails the required uniform equicontinuity on the estimators.*
- 2)
Following the same ideas used to prove Theorem 1, it can be proven that (see Theorem 2 in Appendix).
If the density is bounded from below by a positive constant, we have the following direct corollary.
Corollary 1
Under the hypotheses of Theorem 1, if in addition the density fulfils that there exists and such that for all , then,
[TABLE]
3.5 A central limit theorem
The following theorem states that a central limit theorem for holds, when the limit is taken first as and second as .
Theorem 3
Let such that . Then, for all such that and
- •
**
- •
Exists , such that is compact, and is compact a.s. for all .
- •
Exists , such that as for all .
- •
* as .*
- •
* for all as .*
We have,
[TABLE]
Remark 3
*The previous theorem depends on the calculus of , which is in general unknown. However, in some cases it can be estimated, by means of a Monte-Carlo method, using a uniformly consistent estimator of and a sample of uniformly distributed random variables on a box containing the set .
For the special case of spherical densities (i.e., for some ), the limit of can easily be derived, as is proven in the following proposition.
Proposition 2
Let be a spherical density such that is strictly decreasing and is continuous on a neighbourhood containing , then, for all such that , and ,
[TABLE]
where is Euler’s gamma function.
4 Models used for the simulations
First, we performed a simulation study to assess, in terms of the mean square error, the proposed aggregation strategy. Second, we evaluate the departure from normality in Theorem 3. Five different distributions were considered:
- 1
Beta, with density
- 2
Normal, with mean [math] and variance is a diagonal matrix.
- 3
Weibull, with density \left(\frac{k}{\lambda^{k}}\right)^{d}(x_{1}\cdots x_{d})^{d(k-1)}\exp\Big{(}-\sum^{d}_{i=1}(x_{i}/\lambda)^{k}\Big{)}.
- 4
Convex combination of two bi-variate normal distributions with the same covariance matrix : where given below.
- 5
Convex combination of two bi-variate normal distributions: where
[TABLE]
To build the estimator we considered five kernel-based density estimators computed with different bandwidth . The bandwidths were chosen as follows: first we compute the leave-one-out cross validation bandwidth based on a sample of size . This value is kept fixed along the replicates. Then, we fix , , , and . We choose for and for . Let us denote the leave-one-out cross validation bandwidth based on the whole sample. The parameter was selected as follows: first, we compute the five kernel-based density estimators based on the whole sample , with bandwidth , , , and , we then compute the average of them; i.e,
Finally, is the value that minimize (where denotes the norm).
The measures are computed by Monte-Carlo method using and uniformly distributed random variables in dimensions and , respectively. Two different kernels where considered: the Epanechnikov kernel (denoted by E), and the Gaussian kernel (denoted by G). The whole procedure is repeated 100 times. We report , estimated from a test sample, uniformly distributed over a rectangle in , or .
Figure 3 shows the level sets for the density of model 4 (left panel) and model 5 (right panel).
The results in tables 1 to 4 show that except for some results in Table 1, the best performance is obtained by the aggregated estimator. Moreover, Table 1 also shows that in 5 over 8 models, this is also the case.
To illustrate Theorem 3, we have considered a bi-variate normal distribution with variance and mean . We fixed the point as for the normal distribution. The measure is computed exactly from the density, in this case and for , . We have chosen , , computed and repeated 1000 times. The estimator was built using (with Gaussian kernel). The density of the was estimated using a kernel density estimator, with a univariate Gaussian kernel with bandwidth 0.15. The result is shown in Figure 4 and the summary is given in Table 5. The p-value of Shapiro-Wilks test is and for Lilliefors test of normality.
5 Final Remarks
We have proposed a new non-linear aggregation method for density estimation and we have studied its asymptotic properties and limit distribution under quite mild assumptions.
- 2)
The aggregated estimator behaves better than all of the density estimators used for the aggregation.
- 3)
We performed a small simulation study, which shows that in all cases the aggregation outperforms the kernel rules built with the sample . In addition, in most of them, it outperforms the kernel rules built with the whole sample .
- 4)
Our simulations suggest that the second term in (8) is negligible with respect to the first term, but we were not able to prove this point theoretically.
- 5)
The aggregation method is quite sensitive to the choice of the parameter ; however, because it is shown in the tables, our recipe seems to work well.
6 Appendix
Proof of Proposition 1
We start by decomposing the objective function,
[TABLE]
Conditionally to and , is constant, then
[TABLE]
From it follows \mathbb{E}\big{[}\mathbb{E}[f(X)|\mathbf{f_{k}}(X)]|\mathbf{f_{k}}(X),\mathcal{D}_{n}\big{]}=$$\mathbb{E}\big{[}f(X)|\mathbf{f_{k}}(X),\mathcal{D}_{n}\big{]}, then \mathbb{E}\big{[}T(\mathbf{f_{k}}(X))-f(X)|\mathbf{f_{k}}(X),\mathcal{D}_{n}\big{]}=0, which implies that
[TABLE]
Lastly since , where the minimum is taken over the functions such that , (8) follows.
Proof of Theorem 1 First let us bound the second term in (8),
[TABLE]
If we denote
[TABLE]
then
[TABLE]
Conditionally to and , the random variable is binomial with probability . Then
[TABLE]
We can bound a.s., where . Since are uniformly equicontinuous, for all there exists such that for all , and all , if . By hypothesis we can assume that . Then, , from where it follows,
[TABLE]
Regarding , observe that,
[TABLE]
To prove this , by Lemma 1, it is enough to show that
[TABLE]
Because is bounded due to the dominated convergence theorem, it is enough to prove that
[TABLE]
Let such that for all , . For such there exists such that for all , the sets are compact a.s, and is compact, then by Lemma 2 a.s. By using again the dominated convergence theorem, we obtain that, with probability one,
[TABLE]
Lastly, (12) follows from the fact that for all and for all , .
*Proof of Lemma 1
By Lemma 1.3 in Alonso and Brambila-Paz (1998), it is enough to prove that the sequence of -algebras , -approaches ; i.e., for all there exists such that as . Since is enough to consider with . Let us consider, for , and . Let , clearly . For all ,
[TABLE]
[TABLE]
Because the sequence of sets is increasing as increase,
[TABLE]
if ,
[TABLE]
Because the sequence of sets decreases as ,
[TABLE]
if , then
[TABLE]
Because , for all there exists such that By (14) and (13) for all , \mathbb{P}\big{(}(\cap_{r=1}^{\infty}B_{i}(1/r))\triangle B\big{)}\leq\delta+\mathbb{P}(H_{i}(\epsilon_{1})^{c}). For all , as , from where it follows \lim_{i\rightarrow\infty}\mathbb{P}\big{(}(\cap_{r=1}^{\infty}B_{i}(1/r))\triangle B\big{)}=0.
*Proof of Lemma 2
Let us fix such that as , and for all . First we will prove that, for all , with probability one, for large enough . Since and are uniformly continuous on we can take such that for all there exists such that . Let and such that . Then,
[TABLE]
Meanwhile,
[TABLE]
Let large enough such that for all , , and . Then,
Now let us prove that for all such that , a.s, as . Proceeding as before, let us consider such that for all there exists such that for all . Let and , such that for all . Then, for all ,
[TABLE]
Let be large enough such that for all , and . Because , , from where it follows that . Lastly implies (9). To prove (10) let and small enough such that , for that , with probability one, we can take large enough such that,
Proof of Theorem 2
Let us denote , then,
[TABLE]
Observe that
[TABLE]
If we bound , then we get
[TABLE]
Proceeding as in Theorem 1, it is proved that a.s.
Regarding observe that,
[TABLE]
To prove that , by Lemma 1, it is enough to show that
[TABLE]
Because and are bounded, due to dominated convergence theorem, it is enough to prove that
[TABLE]
Indeed, by using again dominated convergence theorem, together with Lemma 2, we obtain that,
[TABLE]
Meanwhile,
[TABLE]
Lastly, (15) follows from (5) and the fact that for all .
Proof of Theorem 3
First, let us prove that
[TABLE]
By Lemma 2 we get that for all fixed ,
[TABLE]
and
[TABLE]
Lastly, from
[TABLE]
it follows (16). Let us write
[TABLE]
Since then for small enough. From (17) and (18) together with (19) and , it follows that a.s.
Let us denote the random sample in (i.e for ). From (4) together with (17) and (18),
[TABLE]
Let us denote
[TABLE]
We will use the following version of the central limit theorem for triangular arrays.
Theorem (Lindeberg.) Let independent r.v. such that for all , and . Let us denote If for all
[TABLE]
then
[TABLE]
Let us consider , define for ,
[TABLE]
[TABLE]
so
[TABLE]
From we get as , and then using (19) it follows,
[TABLE]
To prove (20)
[TABLE]
Since is bounded and it follows that . Then, with probability one, for large enough, and then it follows (20).
Now from (21), as ,
[TABLE]
so,
[TABLE]
where . Then, from (22),
[TABLE]
Proof of Proposition 2
Let us denote the level set of , since is spherical for all being the support of , and then, using that is a continuous function on a neighbourhood containing , for small enough, is a continuous function on . Since is bounded and then is Lipschitz on for small enough. By Theorem 3.1 in Federer (1959) we can write,
[TABLE]
where denotes the -dimensional Hausdorff measure. Let us prove that is continuous for all on a neighbourhood of . Observe that implies \mathcal{H}_{d-1}(\partial L(\gamma))=\mathcal{H}_{d-1}\big{(}\partial\mathcal{B}(0,\sqrt{h^{-1}(\gamma)})\big{)}. Since is strictly decreasing there exists (which is continuous on a neighbourhood of because is derivable) and as . By the Mean Value Theorem
[TABLE]
Let us denote and , then from (23)
[TABLE]
Since is decreasing we get that decreases (to ) as decreases. From the continuity of at it follows that as . Analogously, . Lastly, from the continuity of and (24) we get that
[TABLE]
where we have used that
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alonso and Brambila-Paz (1998) Alonso, A. and Brambila-Paz, F. (1998). L p superscript 𝐿 𝑝 L^{p} -Continuity of conditional expectations. Journal of Mathematical Analysis and Applications Vol. 221, pp. 161–176.
- 2Biau et al (2016) Biau. G, Fischer, A. Guedj, B. and Malley, J. (2016). COBRA: A combined regression strategy. Journal of Multivariate Analysis Vol. 146, pp. 18-28.
- 3Bellec (2017) Bellec, P. C. (2017) Optimal exponential bounds for aggregation of density estimators. Bernoulli. Vol 23(1), pp. 219–248.
- 4Bourel and Ghattas (2013) Bourel, M. and Ghattas, B. (2013) Aggregating density estimators: an empirical study. Open Journal of Statistics , Vol 3, pp. 334–355.
- 5Breiman (1996) Breiman, L. (1996) Bagging Predictors. Machine Learning Vol. 24, No. 2, pp. 123-140.
- 6Chacón and Duong (2018) Chacón, J.E., and Duong, T. (2018) Multivariate Kernel Smoothing and Its Applications. Chapman and Hall/CRC. ISBN 9781498763011.
- 7Cholaquidis et al (2016) Cholaquidis, A. Fraiman, R., Kalemkerian, J. and Llop, P. (2016) A nonlinear aggregation type classifier Journal of Multivariate Analysis Vol. 146, pp. 269–281.
- 8Federer (1959) Federer, H. (1959). Curvature measures. Trans. Amer. Math. Soc. Vol. 93, 418–491.
