A combined strategy for multivariate density estimation

Alejandro Cholaquidis; Ricardo Fraiman; Badih Ghattas; Juan; Kalemkerian

arXiv:1812.04343·math.ST·December 24, 2018

A combined strategy for multivariate density estimation

Alejandro Cholaquidis, Ricardo Fraiman, Badih Ghattas, Juan, Kalemkerian

PDF

Open Access

TL;DR

This paper introduces a novel non-linear aggregation method for multivariate density estimation that improves accuracy by considering neighborhoods of estimated level sets, supported by theoretical and simulation results.

Contribution

It proposes a new density estimation strategy based on level set neighborhoods, addressing computational challenges of existing methods and demonstrating improved mean squared error.

Findings

01

Lower mean squared error compared to traditional aggregation methods

02

Theoretical proof of a Central Limit Theorem for the estimator

03

Validated effectiveness through simulation studies

Abstract

Non-linear aggregation strategies have recently been proposed in response to the problem of how to combine, in a non-linear way, estimators of the regression function (see for instance \cite{biau:16}), classification rules (see \cite{ch:16}), among others. Although there are several linear strategies to aggregate density estimators, most of them are hard to compute (even in moderate dimensions). Our approach aims to overcome this problem by estimating the density at a point $x$ using not just sample points close to $x$ but in a neighborhood of the (estimated) level set $f (x)$ . We show, both theoretically and through a simulation study, that the mean squared error of our proposal is smaller than that of the aggregated densities. A Central Limit Theorem is also proven.

Tables5

Table 1. Table 1: L 2 subscript 𝐿 2 L_{2} error over 100 100 100 replicates for model 1. For the test sample, we used 2000 2000 2000 uniformly distributed points on [ 0 , 1 ] 2 superscript 0 1 2 [0,1]^{2} . The measure of B ( ϵ , x ) 𝐵 italic-ϵ 𝑥 B(\epsilon,x) is estimated using 20000 20000 20000 uniformly distributed points on [ 0 , 1 ] 2 superscript 0 1 2 [0,1]^{2} for d = 2 𝑑 2 d=2 , and 40000 40000 40000 for d = 4 𝑑 4 d=4 .

	$α = 1.5 = β$		$α = 2.5 = β$
	$d = 2$	$d = 4$	$d = 2$	$d = 4$
$n, k$	2000	4000	2000	4000
Kernel	G	G	G	G
${\hat{f}}_{agg}$	0.090	0.185	0.118	0.269
$f_{k, γ_{1}}$	0.111	0.240	0.122	0.305
$f_{k, γ_{2}}$	0.110	0.233	0.123	0.300
$f_{k, γ_{3}}$	0.111	0.231	0.125	0.301
$f_{k, γ_{4}}$	0.113	0.232	0.128	0.307
$f_{k, γ_{5}}$	0.115	0.235	0.131	0.316
$f_{n, γ_{1}}$	0.093	0.200	0.100	0.256
$f_{n, γ_{2}}$	0.095	0.201	0.103	0.261
$f_{n, h c v u}$	0.092	0.200	0.098	0.255
$f_{n, γ_{4}}$	0.101	0.211	0.112	0.282
$f_{n, γ_{5}}$	0.104	0.218	0.117	0.297

Table 2. Table 2: L 2 subscript 𝐿 2 L_{2} error over 100 100 100 replications for model 2. The test sample consist of 2000 uniformly distributed points on [ 0 , 4 ] 2 superscript 0 4 2 [0,4]^{2} in dimension 2 and 4000 uniformly distributed points on [ 0 , 5 ] 4 superscript 0 5 4 [0,5]^{4} in dimension 4.

	$λ = 1, k = 1$	$λ = 1, k = 0.5$	$λ = 1, k = 1$
	$d = 2$ ,	$d = 2$	$d = 4$
$n, k$	2000	2000	40000
Kernel	E	E	E
${\hat{f}}_{agg}$	0.069	0.653	0.009
$f_{k, γ_{1}}$	0.119	0.678	0.026
$f_{k, γ_{2}}$	0.113	0.677	0.024
$f_{k, γ_{3}}$	0.108	0.676	0.021
$f_{k, γ_{4}}$	0.103	0.675	0.019
$f_{k, γ_{5}}$	0.098	0.674	0.018
$f_{n, γ_{1}}$	0.086	0.672	0.018
$f_{n, γ_{2}}$	0.082	0.671	0.017
$f_{n, h c v u}$	0.078	0.694	0.020
$f_{n, γ_{4}}$	0.074	0.670	0.014
$f_{n, γ_{5}}$	0.072	0.670	0.012

Table 3. Table 3: L 2 subscript 𝐿 2 L_{2} error for model 2 over 100 100 100 replicates using Epanechnikov’s kernel. In ℝ 2 superscript ℝ 2 \mathbb{R}^{2} Σ = d i a g ( σ 1 2 , σ 2 2 ) Σ 𝑑 𝑖 𝑎 𝑔 superscript subscript 𝜎 1 2 superscript subscript 𝜎 2 2 \Sigma=diag(\sigma_{1}^{2},\sigma_{2}^{2}) , and in ℝ 4 superscript ℝ 4 \mathbb{R}^{4} Σ = d i a g ( σ 1 2 , σ 2 2 , σ 3 2 , σ 4 2 ) Σ 𝑑 𝑖 𝑎 𝑔 superscript subscript 𝜎 1 2 superscript subscript 𝜎 2 2 superscript subscript 𝜎 3 2 superscript subscript 𝜎 4 2 \Sigma=diag(\sigma_{1}^{2},\sigma_{2}^{2},\sigma_{3}^{2},\sigma_{4}^{2}) . The test consist of 2000 uniform on [ − 3 , 3 ] × [ − 2 , 1.5 ] 3 3 2 1.5 [-3,3]\times[-2,1.5] for the first column and on [ − 2.5 , 2.5 ] 2 superscript 2.5 2.5 2 [-2.5,2.5]^{2} for the second column. For dimension 4 4 4 , the test sample is uniformly distributed on [ − 3 , 3 ] × [ − 1.5 , 1.5 ] 2 × [ − 3 , 3 ] 3 3 superscript 1.5 1.5 2 3 3 [-3,3]\times[-1.5,1.5]^{2}\times[-3,3] .

	d=2	d=2	$d = 4$
	$σ_{1} = 1, σ_{2} = 0.25$	$σ_{1} = 1, σ_{2} = 0.1$	$σ_{1} = 1 = σ_{4}$ ,
			$σ_{2} = .5 = σ_{3}$
$n = k$	2000	2000	4000
Kernel	E	E	E
${\hat{f}}_{agg}$	0.023	0.082	0.008
$f_{k, γ_{1}}$	0.034	0.126	0.050
$f_{k, γ_{2}}$	0.033	0.120	0.045
$f_{k, γ_{3}}$	0.032	0.114	0.041
$f_{k, γ_{4}}$	0.030	0.108	0.037
$f_{k, γ_{5}}$	0.029	0.104	0.034
$f_{n, . γ_{1}}$	0.027	0.094	0.036
$f_{n, γ_{2}}$	0.026	0.089	0.032
$f_{n, h c v u}$	0.029	0.085	0.035
$f_{n, γ_{4}}$	0.024	0.086	0.026
$f_{n, γ_{5}}$	0.024	0.089	0.024

Table 4. Table 4: L 2 subscript 𝐿 2 L_{2} error over 100 100 100 replicates using Epanechnikov’s kernel for model (first column) and 5 (second column) in ℝ 2 superscript ℝ 2 \mathbb{R}^{2} with μ 1 = ( − 1 , 1 ) subscript 𝜇 1 1 1 \mu_{1}=(-1,1) and μ 2 = ( 1 , 1 ) subscript 𝜇 2 1 1 \mu_{2}=(1,1) . In both models, we used 2000 2000 2000 uniformly distributed points for the test sample, in model 4 on [ − 1 , 1 ] 2 superscript 1 1 2 [-1,1]^{2} while in model 5 on [ − 2 , 2 ] 2 superscript 2 2 2 [-2,2]^{2} . In both models, k = l = 2000 𝑘 𝑙 2000 k=l=2000 and the measure of the B ( ϵ , x ) 𝐵 italic-ϵ 𝑥 B(\epsilon,x) are estimated using 20000 20000 20000 uniformly distributed points in [ − 2 , 2 ] 2 superscript 2 2 2 [-2,2]^{2} .

	$σ_{1}^{2} = .5, σ_{2}^{2} = .3, ρ = .2$
${\hat{f}}_{agg}$	0.071	0.063
$f_{k, γ_{1}}$	0.105	0.113
$f_{k, γ_{2}}$	0.103	0.109
$f_{k, γ_{3}}$	0.102	0.106
$f_{k, γ_{4}}$	0.100	0.103
$f_{k, γ_{5}}$	0.099	0.101
$f_{n, γ_{1}}$	0.096	0.094
$f_{n, γ_{2}}$	0.095	0.092
$f_{n, h c v u}$	0.094	0.107
$f_{n, γ_{4}}$	0.093	0.088
$f_{n, γ_{5}}$	0.093	0.086

Table 5. Table 5: Summary of the simulations for Theorem 2.

Min.	1st. Qu	Median	Mean	Var	3rd Qu	Max
-0.946	-0.223	0.012	0.010	0.113	0.236	1.156

Equations141

B (ϵ, x) = {y \in R^{d} : m = 1 ⋂ M ∣ f_{m} (y) - f_{m} (x) ∣ < ϵ} .

B (ϵ, x) = {y \in R^{d} : m = 1 ⋂ M ∣ f_{m} (y) - f_{m} (x) ∣ < ϵ} .

N (ϵ, x) =

N (ϵ, x) =

=

\hat{f}_{agg} (x) = \frac{N ( ϵ , x )}{μ ( B ( ϵ , x ))} = \frac{\sum _{j = 1}^{l} I _{B (ϵ, x)} ( X _{j + k} )}{l μ ( B ( ϵ , x ))} .

\hat{f}_{agg} (x) = \frac{N ( ϵ , x )}{μ ( B ( ϵ , x ))} = \frac{\sum _{j = 1}^{l} I _{B (ϵ, x)} ( X _{j + k} )}{l μ ( B ( ϵ , x ))} .

\tilde{N}(\epsilon,x)=\frac{1}{l}\sum_{j=1}^{l}\prod_{m=1}^{M}K\Big{[}\frac{f_{m}(X_{k+j})-f_{m}(x)}{\epsilon}\Big{]},

\tilde{N}(\epsilon,x)=\frac{1}{l}\sum_{j=1}^{l}\prod_{m=1}^{M}K\Big{[}\frac{f_{m}(X_{k+j})-f_{m}(x)}{\epsilon}\Big{]},

\lim_{\epsilon\to 0}\frac{1}{\mu(B^{*}(\epsilon,X))}\int_{B^{*}(\epsilon,x)}K\Big{[}\frac{f(t)-f(X)}{\epsilon}\Big{]}^{M}dt=1\quad a.s.

\lim_{\epsilon\to 0}\frac{1}{\mu(B^{*}(\epsilon,X))}\int_{B^{*}(\epsilon,x)}K\Big{[}\frac{f(t)-f(X)}{\epsilon}\Big{]}^{M}dt=1\quad a.s.

\tilde{f}_{agg} (x) = \frac{N ~ ( ϵ , x )}{μ ( B ( ϵ , x ))} .

\tilde{f}_{agg} (x) = \frac{N ~ ( ϵ , x )}{μ ( B ( ϵ , x ))} .

B^{η} (ϵ, x) = {y \in R^{d} : \frac{1}{M} m = 1 \sum M I_{{∣ f_{m} (y) - f_{m} (x) ∣ < ϵ}} \geq 1 - η} .

B^{η} (ϵ, x) = {y \in R^{d} : \frac{1}{M} m = 1 \sum M I_{{∣ f_{m} (y) - f_{m} (x) ∣ < ϵ}} \geq 1 - η} .

\tilde{f}_{\text{agg},\eta}(x)=\frac{1}{l}\sum_{j=1}^{l}\mathbb{I}_{\{X_{k+j}\in B^{\eta}(\epsilon,x)\}}\prod_{m=1}^{M}K\Big{[}\frac{f_{m}(X_{k+j})-f_{m}(x)}{\epsilon}\Big{]}.

\tilde{f}_{\text{agg},\eta}(x)=\frac{1}{l}\sum_{j=1}^{l}\mathbb{I}_{\{X_{k+j}\in B^{\eta}(\epsilon,x)\}}\prod_{m=1}^{M}K\Big{[}\frac{f_{m}(X_{k+j})-f_{m}(x)}{\epsilon}\Big{]}.

E ∣ \hat{f}_{agg} (X) - f (X) ∣^{2} \leq m = 1, \dots, M min E ∣ f_{m} (X) - f (X) ∣^{2} + E ∣ \hat{f}_{agg} (X) - T (f_{k} (X)) ∣^{2},

E ∣ \hat{f}_{agg} (X) - f (X) ∣^{2} \leq m = 1, \dots, M min E ∣ f_{m} (X) - f (X) ∣^{2} + E ∣ \hat{f}_{agg} (X) - T (f_{k} (X)) ∣^{2},

\lim_{i\rightarrow\infty}\mathbb{E}\big{|}\mathbb{E}[f(X)|f_{i}(X)]-f(X)\big{|}^{2}=0.

\lim_{i\rightarrow\infty}\mathbb{E}\big{|}\mathbb{E}[f(X)|f_{i}(X)]-f(X)\big{|}^{2}=0.

μ (B (ϵ, x)) \to μ (B^{*} (ϵ, x)) a . s ., as k \to \infty,

μ (B (ϵ, x)) \to μ (B^{*} (ϵ, x)) a . s ., as k \to \infty,

P_{X} (B (ϵ, x)) \to P_{X} (B^{*} (ϵ, x)) a . s ., as k \to \infty.

P_{X} (B (ϵ, x)) \to P_{X} (B^{*} (ϵ, x)) a . s ., as k \to \infty.

ϵ \to 0 lim l \to \infty lim E ∣ \hat{f}_{agg} (X) - T (f_{k} (X)) ∣^{2} = 0.

ϵ \to 0 lim l \to \infty lim E ∣ \hat{f}_{agg} (X) - T (f_{k} (X)) ∣^{2} = 0.

ϵ \to 0 lim l \to \infty lim E ∣ \tilde{f}_{agg} (X) - T (f_{k} (X)) ∣^{2} = 0.

ϵ \to 0 lim l \to \infty lim E ∣ \tilde{f}_{agg} (X) - T (f_{k} (X)) ∣^{2} = 0.

ϵ \to 0 lim l \to \infty lim E \int_{R^{d}} ∣ \hat{f}_{agg} (x) - f (x) ∣^{2} d x = 0.

ϵ \to 0 lim l \to \infty lim E \int_{R^{d}} ∣ \hat{f}_{agg} (x) - f (x) ∣^{2} d x = 0.

\lim_{l\rightarrow\infty}\lim_{k\rightarrow\infty}\sqrt{\mu(B^{*}(\epsilon,x))l}\bigg{[}\hat{f}_{\emph{agg}}(x)-f(x)\bigg{]}\stackrel{{\scriptstyle d}}{{=}}N(0,f(x)).

\lim_{l\rightarrow\infty}\lim_{k\rightarrow\infty}\sqrt{\mu(B^{*}(\epsilon,x))l}\bigg{[}\hat{f}_{\emph{agg}}(x)-f(x)\bigg{]}\stackrel{{\scriptstyle d}}{{=}}N(0,f(x)).

l \to \infty lim \frac{μ ( B ^{*} ( ϵ , x ))}{2 ϵ} = \frac{2 π ^{d /2} ∥ x ∥ ^{d - 1}}{Γ ( \frac{d}{2} ) ∥\nabla f ( x ) ∥},

l \to \infty lim \frac{μ ( B ^{*} ( ϵ , x ))}{2 ϵ} = \frac{2 π ^{d /2} ∥ x ∥ ^{d - 1}}{Γ ( \frac{d}{2} ) ∥\nabla f ( x ) ∥},

Σ_{1} = [σ_{1}^{2} ρ ρ σ_{2}^{2}] and Σ_{2} = [σ_{2}^{2} - ρ - ρ σ_{1}^{2}] .

Σ_{1} = [σ_{1}^{2} ρ ρ σ_{2}^{2}] and Σ_{2} = [σ_{2}^{2} - ρ - ρ σ_{1}^{2}] .

\mathbb{E}|\hat{f}_{\text{agg}}(X)-f(X)|^{2}=\mathbb{E}|\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))|^{2}+\mathbb{E}|T(\mathbf{f_{k}}(X))-f(X)|^{2}\\ -2\mathbb{E}\Big{[}\big{[}\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))\big{]}\big{[}T(\mathbf{f_{k}}(X))-f(X)\big{]}\Big{]}.

\mathbb{E}|\hat{f}_{\text{agg}}(X)-f(X)|^{2}=\mathbb{E}|\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))|^{2}+\mathbb{E}|T(\mathbf{f_{k}}(X))-f(X)|^{2}\\ -2\mathbb{E}\Big{[}\big{[}\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))\big{]}\big{[}T(\mathbf{f_{k}}(X))-f(X)\big{]}\Big{]}.

\mathbb{E}\Big{[}\big{[}\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))\big{]}\big{[}T(\mathbf{f_{k}}(X))-f(X)\big{]}\Big{]}=\\ \mathbb{E}\Big{[}\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))\mathbb{E}\big{[}T(\mathbf{f_{k}}(X))-f(X)|\mathbf{f_{k}}(X),\mathcal{D}_{n}\big{]}\Big{]}.

\mathbb{E}\Big{[}\big{[}\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))\big{]}\big{[}T(\mathbf{f_{k}}(X))-f(X)\big{]}\Big{]}=\\ \mathbb{E}\Big{[}\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))\mathbb{E}\big{[}T(\mathbf{f_{k}}(X))-f(X)|\mathbf{f_{k}}(X),\mathcal{D}_{n}\big{]}\Big{]}.

\mathbb{E}\Big{[}\big{[}\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))\big{]}\big{[}T(\mathbf{f_{k}}(X))-f(X)\big{]}\Big{]}=0.

\mathbb{E}\Big{[}\big{[}\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))\big{]}\big{[}T(\mathbf{f_{k}}(X))-f(X)\big{]}\Big{]}=0.

\mathbb{E}|\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))|^{2}\leq\\ 2\mathbb{E}\Bigg{[}\frac{1}{l\mu(B(\epsilon,X))}\Big{[}\sum_{j=1}^{l}\mathbb{I}_{B(\epsilon,X)}(X_{k+j})-\mathbb{E}[\mathbb{I}_{B(\epsilon,X)}(X_{k+1})|\mathcal{D}_{k},\mathbf{f_{k}}(X)]\Big{]}\Bigg{]}^{2}\\ +2\mathbb{E}\Bigg{[}\frac{1}{\mu(B(\epsilon,X))}\mathbb{E}\Big{[}\mathbb{I}_{B(\epsilon,X)}(X_{k+1})|\mathcal{D}_{k},\mathbf{f_{k}}(X)\Big{]}-T(\mathbf{f_{k}}(X))\Bigg{]}^{2}=I_{1}+I_{2}.

\mathbb{E}|\hat{f}_{\text{agg}}(X)-T(\mathbf{f_{k}}(X))|^{2}\leq\\ 2\mathbb{E}\Bigg{[}\frac{1}{l\mu(B(\epsilon,X))}\Big{[}\sum_{j=1}^{l}\mathbb{I}_{B(\epsilon,X)}(X_{k+j})-\mathbb{E}[\mathbb{I}_{B(\epsilon,X)}(X_{k+1})|\mathcal{D}_{k},\mathbf{f_{k}}(X)]\Big{]}\Bigg{]}^{2}\\ +2\mathbb{E}\Bigg{[}\frac{1}{\mu(B(\epsilon,X))}\mathbb{E}\Big{[}\mathbb{I}_{B(\epsilon,X)}(X_{k+1})|\mathcal{D}_{k},\mathbf{f_{k}}(X)\Big{]}-T(\mathbf{f_{k}}(X))\Bigg{]}^{2}=I_{1}+I_{2}.

h(\mathbf{f_{k}}(X),\mathcal{D}_{k})=\\ \mathbb{E}\Bigg{[}\Big{[}\sum_{j=1}^{l}\mathbb{I}_{B(\epsilon,X)}(X_{k+j})-l\mathbb{E}[\mathbb{I}_{B(\epsilon,X)}(X_{k+1})|\mathbf{f_{k}}(X),\mathcal{D}_{k}]\Big{]}^{2}\Big{|}\mathbf{f_{k}}(X),\mathcal{D}_{k}\Bigg{]},

h(\mathbf{f_{k}}(X),\mathcal{D}_{k})=\\ \mathbb{E}\Bigg{[}\Big{[}\sum_{j=1}^{l}\mathbb{I}_{B(\epsilon,X)}(X_{k+j})-l\mathbb{E}[\mathbb{I}_{B(\epsilon,X)}(X_{k+1})|\mathbf{f_{k}}(X),\mathcal{D}_{k}]\Big{]}^{2}\Big{|}\mathbf{f_{k}}(X),\mathcal{D}_{k}\Bigg{]},

I_{1}=2\mathbb{E}\Bigg{[}\frac{1}{(l\mu(B(\epsilon,X)))^{2}}h(\mathbf{f_{k}}(X)\Bigg{]}.

I_{1}=2\mathbb{E}\Bigg{[}\frac{1}{(l\mu(B(\epsilon,X)))^{2}}h(\mathbf{f_{k}}(X)\Bigg{]}.

I_{1} =

I_{1} =

\leq

I_{1}\leq 2\mathbb{E}\Bigg{[}\frac{C}{l\mu(B(\epsilon,X))}\Bigg{]}\leq 2\frac{C}{\omega_{d}l\delta_{0}^{d}}\rightarrow 0\quad\text{ as }l\rightarrow\infty.

I_{1}\leq 2\mathbb{E}\Bigg{[}\frac{C}{l\mu(B(\epsilon,X))}\Bigg{]}\leq 2\frac{C}{\omega_{d}l\delta_{0}^{d}}\rightarrow 0\quad\text{ as }l\rightarrow\infty.

I_{2}=2\mathbb{E}\Bigg{\{}\frac{1}{\mu\left(B\left(\epsilon,X\right)\right)}\int_{B\left(\epsilon,X\right)}f(t)dt-T(\mathbf{f_{k}}(X))\Bigg{\}}^{2}.

I_{2}=2\mathbb{E}\Bigg{\{}\frac{1}{\mu\left(B\left(\epsilon,X\right)\right)}\int_{B\left(\epsilon,X\right)}f(t)dt-T(\mathbf{f_{k}}(X))\Bigg{\}}^{2}.

ϵ \to 0 lim l \to \infty lim E (\frac{1}{μ ( B ( ϵ , X ) )} \int_{B (ϵ, X)} f (t) d t - f (X))^{2} = 0.

ϵ \to 0 lim l \to \infty lim E (\frac{1}{μ ( B ( ϵ , X ) )} \int_{B (ϵ, X)} f (t) d t - f (X))^{2} = 0.

ϵ \to 0 lim l \to \infty lim \frac{1}{μ ( B ( ϵ , X ) )} \int_{B (ϵ, X)} f (t) d t = f (X) a . s .

ϵ \to 0 lim l \to \infty lim \frac{1}{μ ( B ( ϵ , X ) )} \int_{B (ϵ, X)} f (t) d t = f (X) a . s .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models · Statistical Methods and Inference · Face and Expression Recognition

Full text

**A combined strategy for multivariate density estimation

**

Alejandro Cholaquidisa, Ricardo Fraimana, Badih Ghattasb

and Juan Kalemkeriana

a Universidad de la República, Uruguay

b Aix Marseille Université, CNRS, Marseille, France.

Abstract

Non-linear aggregation strategies have recently been proposed in response to the problem of how to combine, in a non-linear way, estimators of the regression function (see for instance Biau et al (2016)), classification rules (see Cholaquidis et al (2016)), among others. Although there are several linear strategies to aggregate density estimators, most of them are hard to compute (even in moderate dimensions). Our approach aims to overcome this problem by estimating the density at a point $x$ using not just sample points close to $x$ but in a neighborhood of the (estimated) level set $f(x)$ . We show, both theoretically and through a simulation study, that the mean squared error of our proposal is smaller than that of the aggregated densities. A Central Limit Theorem is also proven.

1 Introduction

Density estimation is still an important and active area of research that has many statistical applications, particularly in supervised and unsupervised learning, see for instance the recent book by Chacón and Duong (2018). Although this is a well-studied subject, when the data belongs to high or even moderate dimensions, such as $\mathbb{R}^{2}$ or $\mathbb{R}^{4}$ , this becomes a difficult problem due to the well-known curse of dimensionality. This is also the case for non-parametric regression. For this last problem, Biau et al (2016) propose a non-linear aggregation method that is very close in spirit to our approach. In Cholaquidis et al (2016), the authors propose a similar idea for classification. To tackle this problem, we introduce a new non-linear aggregation method that is well designed for moderate dimensions. Our approach is based on two main ideas:

The first idea is to compute the estimator of $f$ at the point $x$ using an estimator of a $\epsilon$ -neighborhood of the level set, i.e, $\{y:|f(y)-f(x)|\leq\epsilon\}\equiv B^{*}(\epsilon,x),$ instead of a neighborhood of the point $x$ , see the right-hand panel of Figure 1 and also see Figure 2. Roughly speaking, under the unrealistic case where $B^{*}(\epsilon,x)$ is known, the estimator that we propose behaves as if the data were in one dimension. In general $B^{*}(\epsilon,x)$ is unknown, consequently a loss of efficiency will appear, which is related to the estimation of the $\epsilon$ -neighborhood.

2)

The second idea is to perform a nonlinear aggregation method to combine several estimators. This will improve the behavior when, for instance, the underlying true density $f$ is not unimodal and the concentration of mass varies significantly within its support, see Figure 1.

Similar ideas have previously been considered for density estimation and non-parametric regression. With respect to 1), a related approach can be found in Fraiman et al (1997), where it is assumed that the density has a particular shape given by the composition of a univariate density with a depth. The particular case of ellipsoidal density has also been considered in Stute and Werner (1991). In our setup, no particular structure is required to the multivariate density.

Starting from the seminal work by Breiman (1996), many linear aggregation methods have been developed, see for instance Lecué (2006); Rigollet and Tsybakov (2007); Bourel and Ghattas (2013); Bellec (2017) and the references therein.

The rest of this paper is organised as follows. In Section 2, we introduce the notation and main definitions used through the manuscript. In Section 3, we define a nonlinear aggregation estimator $\hat{f}_{\text{agg}}$ that is based on a family of density estimators $f_{1},\ldots,f_{M}$ , which requires perfect match of the sets $\{y:|f_{j}(y)-f_{j}(x)|\leq\epsilon\}$ for $j=1,\ldots,M$ , see (1). It can also be relaxed to a partial matching because it will be defined later. In Subsection 3.3, we prove that the aggregated strategy is asymptotically optimal in the sense that it behaves as well as the best density estimator within the family. In Subsection 3.4 we prove consistency in $L^{2}$ , under mild regularity conditions on $f$ . A Central Limit Theorem is proven in Subsection 3.5 for the case $M=1$ . Lastly, in Section 4 we perform some simulations in dimensions $2$ and $4$ , which illustrate the good performance of our approach.

2 Notation

Let us consider $\mathbb{R}^{d}$ endowed with the $d$ -dimensional Lebesgue measure $\mu$ . For $r>0$ , $\mathcal{B}(x,r)$ denotes the open ball of radii $r>0$ , and $\omega_{d}=\mu(\mathcal{B}(0,1))$ . Given $A\subset\mathbb{R}^{d}$ , we will denote by $B(A,\epsilon)$ the parallel set of radius $\epsilon$ of $A$ , that is $B(A,\epsilon)=\{y\in\mathbb{R}^{d}:d(y,A)<\epsilon\}$ where $d(y,A)=\inf_{a\in A}\|y-a\|$ , and $\|\cdot\|$ denotes the Euclidean norm. Given a kernel function $K:\mathbb{R}^{d}\rightarrow\mathbb{R}^{+}$ we say that $K$ is regular if there exists $0<c_{1}<c_{2}<\infty$ such that $c_{1}\mathbb{I}_{\mathcal{B}(0,1)}(x)\leq K(x)\leq c_{2}\mathbb{I}_{\mathcal{B}(0,1)}(x)$ , where $\mathbb{I}_{A}$ stands for the indicator function of the set $A$ . We will denote $K_{h}(x)=K(x/h)$ and $B^{*}(\epsilon,x)=\{y:|f(x)-f(y)|<\epsilon\}$ .

3 The combined estimator

Throughout this manuscript, we will assume that $f$ is a density, bounded from above, such that $f(X)\in L^{2}$ . Let $\mathcal{D}_{n}=\{X_{1},\dots,X_{n}\}$ be iid random vectors with the same distribution $f$ as $X$ . We split $\mathcal{D}_{n}$ into two disjoint subsets, namely $\mathcal{D}_{k}=\{X_{1},\dots,X_{k}\}$ and $\mathcal{E}_{l}=\{X_{k+1},\dots,X_{n}\}$ with $l=n-k$ . Let $\mathbf{f_{k}}(x)=(f_{1}(x),\dots,f_{M}(x))$ be $M$ density estimators computed with the first sample $\mathcal{D}_{k}$ .

For $\epsilon>0$ , we define the combined neighborhood of radius $\epsilon$ , $B(\epsilon,x)$ , of a given point $x$ to be

[TABLE]

Let us consider the estimator of $P_{X}(B(\epsilon,x))$ , given by

[TABLE]

Lastly, the aggregated density estimator is defined as

[TABLE]

3.1 A smoothed approach

Instead of the indicator function used in (2), we can use a one dimensional kernel $K$ . Define,

[TABLE]

where $K$ fulfils,

[TABLE]

Then the alternative aggregated estimator is defined as,

[TABLE]

3.2 An alternative approach

Let $\epsilon>0$ and $0\leq\eta<1$ , define the $\eta$ -neighborhood of radius $\epsilon$ , $B^{\eta}(\epsilon,x)$ , of a given point $x$ to be

[TABLE]

Observe that the for $\eta=0$ we get $B^{\eta}(\epsilon,x)=B(\epsilon,x)$ . We define the $\eta$ -density estimator, $\hat{f}_{\text{agg},\eta}(x)$ as in (4) replacing $B(\epsilon,x)$ with $B^{\eta}(\epsilon,x)$ . Regarding (6) we can define

[TABLE]

3.3 Optimality

The following proposition (which is the analogous for our setup of Proposition 2.1 in Biau et al (2016)), states that the combined estimator behaves as well as the best density estimator, except for the second term, which will be proven to converge to 0 (see Theorem 1).

Proposition 1

With the notation introduced previously

[TABLE]

where $X$ is independent of $\mathcal{D}_{n}$ and $T(\mathbf{f_{k}}(X))=\mathbb{E}(f(X)|\mathbf{f_{k}}(X)).$

Remark 1

It is easy to see that Proposition 1 holds for $\tilde{f}_{\text{agg}}(X)$ .

Now we will state two Lemmas (whose proofs are given in the Appendix), the first proves that the theoretical estimator $T(\mathbf{f_{k}}(X))$ converges in $L^{2}$ to $f(X)$ , as $k\rightarrow\infty$ . The second proves that under point-wise consistency of the density estimators, for all $\epsilon>0$ , with probability one $\mu(B(\epsilon,x))\rightarrow\mu(B^{*}(\epsilon,x))$ as $k\rightarrow\infty$ for almost all $x$ . To do this, let us introduce the following condition,

K1

A random variable $X$ with distribution $P_{X}$ and density $f$ fulfils $K1$ , if $\mathbb{P}(f(X)=a)=0$ for all $a\in\mathbb{R}$ .

Lemma 1

Under $K1$ , if $\{f_{i}\}_{i}$ is any sequence of functions (possibly random) such $\lim_{i\rightarrow\infty}f_{i}(X)=f(X)$ a.s then,

[TABLE]

Lemma 2

Let $X$ be random variable with distribution $P_{X}$ whose density $f$ is continuous. Let $\mathcal{D}_{k}=\{X_{1},\dots,X_{k}\}$ be an iid sample of $X$ and $f_{1},\dots,f_{M}$ be continuous density estimators (built from $\mathcal{D}_{k}$ ), such that for all $m=1,\ldots,M$ , $|f_{m}(x)-f(x)|\rightarrow 0$ a.s., as $k\rightarrow\infty$ for almost all $x$ w.r.t to $\mu$ . Let $\epsilon>0$ , then for all $x$ such that

•

$f_{m}(x)\rightarrow f(x)$ * for $m=1,\ldots,M$ , a.s., as $k\rightarrow\infty$ .*

•

$\mu[B^{*}(\epsilon+\gamma,x)\setminus B^{*}(\epsilon-\gamma,x)]\rightarrow 0$ * as $\gamma\rightarrow 0$ .*

•

$\overline{B^{*}(\epsilon,x)}$ * is compact, and $\overline{B(\epsilon,x)}$ is compact a.s.*

we have

[TABLE]

and

[TABLE]

3.4 Consistency

Because the first term in the right-hand side of (8) does not depend on $l$ and converges to [math] if at least one of the density estimators is mean square error consistent, to prove the consistency (taking limit first in $l$ and second in $\epsilon$ ) for the aggregated estimator, we only need to prove that the second term in the right-hand side of (8) converges to [math] in mean square error. This is done in the following Theorem, under mild regularity restrictions on $P_{X}$ , as well as point-wise convergence for the density estimators and uniform equicontinuity. Recall that a sequence of functions $\{g_{k}\}_{k}$ is said to be uniformly equicontinuous if for all $\epsilon>0$ there exists $\delta=\delta(\epsilon)$ such that for all $k$ , $|g_{k}(x)-g_{k}(y)|<\epsilon$ , whenever $\|x-y\|<\delta$ . All of the proofs of this section are given in the Appendix.

3.4.1 Assumptions

We will consider the following set of assumptions

H1

The density estimators $f_{1},\dots,f_{M}$ based on a sample $\mathcal{D}_{k}$ fulfils H1 if with probability one, the sequences $\{f_{1}\}_{k},\ldots,\{f_{M}\}_{k}$ are uniformly equicontinuous and the $\delta=\delta(\epsilon)$ of the uniform equicontinuity is bounded from below by $\delta_{0}(\epsilon)>0$ .

H2

The density estimators $f_{1},\dots,f_{M}$ based on a sample $\mathcal{D}_{k}$ fulfils H2 if for almost all $x$ w.r.t. $\mu$ , $f_{j}(x)\rightarrow f(x)$ , a.s., for all $j=1,\dots,M$ as $k\rightarrow\infty$ .

Theorem 1

Let us assume K1, H1 and H2. We assume also that, for all $x$ such that $f_{m}(x)\rightarrow f(x)$ for all $m=1,\ldots,M$ , there exists $\epsilon_{0}(x)$ such that for all $0<\epsilon<\epsilon_{0}(x)$ , the set $\overline{B^{*}(\epsilon,x)}$ is compact, the set $\overline{B(\epsilon,x)}$ is compact a.s., and $\mu[B^{*}(\epsilon+\gamma,x)\setminus B^{*}(\epsilon-\gamma,x)]\rightarrow 0$ as $\gamma\rightarrow 0$ . Let $k=k(l)\rightarrow\infty$ as $l\rightarrow\infty$ , then,

[TABLE]

Theorem 2

Under the hypotheses of Theorem 1. If $K$ is a kernel function, bounded from above by $c_{2}<\infty$ , that fulfils (5), then

[TABLE]

Remark 2

Corollary 1 Einmahl and Mason (2005)* proves that if $f$ is uniformly continuous (with some regularity conditions on the kernel $K$ ), then the multidimensional kernel density estimator converges almost surely, uniformly, by choosing a suitable bandwidth. It is easy to see that this entails the required uniform equicontinuity on the estimators.*

2)

Following the same ideas used to prove Theorem 1, it can be proven that $\lim_{\epsilon\rightarrow 0}\lim_{l\rightarrow\infty}\mathbb{E}(\tilde{f}_{\text{agg}}(X)-f(X))^{2}=0$ (see Theorem 2 in Appendix).

If the density $f$ is bounded from below by a positive constant, we have the following direct corollary.

Corollary 1

Under the hypotheses of Theorem 1, if in addition the density $f$ fulfils that there exists $C$ and $A$ such that $0<A\leq f(x)\leq C<\infty$ for all $x$ , then,

[TABLE]

3.5 A central limit theorem

The following theorem states that a central limit theorem for $\hat{f}_{\text{agg}}(x)$ holds, when the limit is taken first as $k\rightarrow\infty$ and second as $l\rightarrow\infty$ .

Theorem 3

Let $\epsilon=\epsilon_{l}\rightarrow 0$ such that $l\epsilon_{l}^{2}\rightarrow 0$ . Then, for all $x$ such that $f(x)>0$ and

•

$\mu(\{y:f(x)=f(y)\})=0$ **

•

Exists $\epsilon_{0}>0$ , such that $\overline{B^{*}(\epsilon^{\prime},x)}$ is compact, and $\overline{B(\epsilon^{\prime},x)}$ is compact a.s. for all $\epsilon^{\prime}<\epsilon_{0}$ .

•

Exists $\epsilon_{0}>0$ , such that $\mu[B^{*}(\epsilon^{\prime}+\gamma,x)\setminus B^{*}(\epsilon^{\prime}-\gamma,x)]\rightarrow 0$ as $\gamma\rightarrow 0$ for all $\epsilon^{\prime}<\epsilon_{0}$ .

•

$\mu(B^{*}(\epsilon,x))l\rightarrow\infty$ * as $l\rightarrow\infty$ .*

•

$f_{m}(x)\rightarrow f(x)$ * for all $m=1,\dots,M$ as $k\rightarrow\infty$ .*

We have,

[TABLE]

Remark 3

*The previous theorem depends on the calculus of $\mu(B^{*}(\epsilon,x))$ , which is in general unknown. However, in some cases it can be estimated, by means of a Monte-Carlo method, using a uniformly consistent estimator $f_{n}$ of $f$ and a sample of uniformly distributed random variables on a box containing the set $B^{*}(\epsilon,x)$ .

For the special case of spherical densities (i.e., $f(x)=h(\|x\|^{2})$ for some $h:\mathbb{R}\rightarrow\mathbb{R}$ ), the limit of $\mu(B^{*}(\epsilon,x))/\epsilon$ can easily be derived, as is proven in the following proposition.

Proposition 2

Let $f$ be a spherical density such that $h$ is strictly decreasing and $h^{\prime}$ is continuous on a neighbourhood containing $\|x\|^{2}$ , then, for all $x$ such that $f(x)>0$ , and $\|\nabla f(x)\|>0$ ,

[TABLE]

where $\Gamma$ is Euler’s gamma function.

4 Models used for the simulations

First, we performed a simulation study to assess, in terms of the mean square error, the proposed aggregation strategy. Second, we evaluate the departure from normality in Theorem 3. Five different distributions were considered:

1

Beta, with density $\left(\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\right)^{d}(x_{1}\cdots x_{d})^{\alpha-1}(1-x_{1})^{\beta-1}\cdots(1-x_{d})^{\beta-1}.$

2

Normal, with mean [math] and variance $\Sigma=\text{diag}(\sigma_{1}^{2},\dots,\sigma_{d}^{2})$ is a diagonal matrix.

3

Weibull, with density $\left(\frac{k}{\lambda^{k}}\right)^{d}(x_{1}\cdots x_{d})^{d(k-1)}\exp\Big{(}-\sum^{d}_{i=1}(x_{i}/\lambda)^{k}\Big{)}$ .

4

Convex combination of two bi-variate normal distributions with the same covariance matrix $\Sigma$ : $(1/2)N(\mu_{1},\Sigma)+(1/2)N(\mu_{2},\Sigma)$ where $\Sigma=\Sigma_{1}$ given below.

5

Convex combination of two bi-variate normal distributions: $(1/2)N(\mu_{1},\Sigma_{1})+(1/2)N(\mu_{2},\Sigma_{2})$ where

[TABLE]

To build the estimator $\hat{f}_{\text{agg}}$ we considered five kernel-based density estimators $f_{k,\gamma_{1}},\dots,f_{k,\gamma_{5}}$ computed with different bandwidth $\gamma_{1},\dots,\gamma_{5}$ . The bandwidths were chosen as follows: first we compute the leave-one-out cross validation bandwidth $hcv$ based on a sample of size $k$ . This value is kept fixed along the replicates. Then, we fix $\gamma_{1}=0.9\times hcv$ , $\gamma_{2}=0.95\times hcv$ , $\gamma_{3}=hcv$ , $\gamma_{4}=1.05\times hcv$ and $\gamma_{5}=1.1\times hcv$ . We choose $k=l=2000$ for $d=2$ and $k=l=4000$ for $d=4$ . Let us denote $hcvu$ the leave-one-out cross validation bandwidth based on the whole sample. The parameter $\epsilon_{l}$ was selected as follows: first, we compute the five kernel-based density estimators $f_{k+l,\tilde{h}_{1}},\dots,f_{k+l,\tilde{h}_{5}}$ based on the whole sample $\mathcal{D}_{k}\cup\mathcal{E}_{l}$ , with bandwidth $\tilde{h}_{1}=0.9\times hcvu$ , $\tilde{h}_{2}=0.95\times hcvu$ , $\tilde{h}_{3}=hcvu$ , $\tilde{h}_{4}=1.05\times hcvu$ and $\tilde{h}_{5}=1.1\times hcvu$ , we then compute the average of them; i.e, $\overline{f}(x)=(f_{k+l,\tilde{h}_{1}}+\ldots+f_{k+l,\tilde{h}_{5}})/5.$

Finally, $\epsilon_{l}$ is the value that minimize $\|\hat{f}_{\text{agg}}-\overline{f}\|_{2}$ (where $\|\cdot\|_{2}$ denotes the $L^{2}$ norm).

The measures $\mu(B(\epsilon_{l},x))$ are computed by Monte-Carlo method using $20000$ and $40000$ uniformly distributed random variables in dimensions $2$ and $4$ , respectively. Two different kernels where considered: the Epanechnikov kernel (denoted by E), and the Gaussian kernel (denoted by G). The whole procedure is repeated 100 times. We report $\|\hat{f}_{\text{agg}}-f\|_{2}$ , estimated from a test sample, uniformly distributed over a rectangle in $\mathbb{R}^{2}$ , or $\mathbb{R}^{4}$ .

Figure 3 shows the level sets for the density of model 4 (left panel) and model 5 (right panel).

The results in tables 1 to 4 show that except for some results in Table 1, the best performance is obtained by the aggregated estimator. Moreover, Table 1 also shows that in 5 over 8 models, this is also the case.

To illustrate Theorem 3, we have considered a bi-variate normal distribution with variance $\Sigma=Id$ and mean $(0,0)$ . We fixed the point $x$ as $(1/2,1/2)$ for the normal distribution. The measure $\mu(B^{*}(\epsilon,x))$ is computed exactly from the density, in this case $f(1/2,1/2)=0.1239$ and for $\epsilon=0.005$ , $\mu(B^{*}(\epsilon,x))=0.5088$ . We have chosen $l=1000$ , $k=6000$ , computed $\sqrt{\mu(B^{*}(\epsilon,x))l}(\hat{f}_{\text{agg}}(x)-f(x))$ and repeated 1000 times. The estimator $\hat{f}_{\text{agg}}$ was built using $f_{k,hcv}$ (with Gaussian kernel). The density of the $N(0,f(x))$ was estimated using a kernel density estimator, with a univariate Gaussian kernel with bandwidth 0.15. The result is shown in Figure 4 and the summary is given in Table 5. The p-value of Shapiro-Wilks test is $0.2772$ and $0.9978$ for Lilliefors test of normality.

5 Final Remarks

We have proposed a new non-linear aggregation method for density estimation and we have studied its asymptotic properties and limit distribution under quite mild assumptions.

2)

The aggregated estimator behaves better than all of the density estimators used for the aggregation.

3)

We performed a small simulation study, which shows that in all cases the aggregation outperforms the kernel rules built with the sample $\mathcal{D}_{k}$ . In addition, in most of them, it outperforms the kernel rules built with the whole sample $\mathcal{D}_{n}$ .

4)

Our simulations suggest that the second term in (8) is negligible with respect to the first term, but we were not able to prove this point theoretically.

5)

The aggregation method is quite sensitive to the choice of the parameter $\epsilon$ ; however, because it is shown in the tables, our recipe seems to work well.

6 Appendix

Proof of Proposition 1

We start by decomposing the objective function,

[TABLE]

Conditionally to $\mathbf{f_{k}}(X)$ and $\mathcal{D}_{n}$ , $\hat{f}_{\text{agg}}(X)$ is constant, then

[TABLE]

From $\sigma(\mathbf{f_{k}}(X),\mathcal{D}_{n})\subset\sigma(\mathbf{f_{k}}(X))$ it follows $\mathbb{E}\big{[}\mathbb{E}[f(X)|\mathbf{f_{k}}(X)]|\mathbf{f_{k}}(X),\mathcal{D}_{n}\big{]}=$$\mathbb{E}\big{[}f(X)|\mathbf{f_{k}}(X),\mathcal{D}_{n}\big{]}$ , then $\mathbb{E}\big{[}T(\mathbf{f_{k}}(X))-f(X)|\mathbf{f_{k}}(X),\mathcal{D}_{n}\big{]}=0$ , which implies that

[TABLE]

Lastly since $\mathbb{E}|T(\mathbf{f_{k}}(X))-f(X)|^{2}=\min_{g}\mathbb{E}[g(\mathbf{f_{k}}(X))-f(X)]^{2}$ , where the minimum is taken over the functions $g$ such that $g(\mathbf{f}_{k}(X))\in L^{2}$ , (8) follows.

Proof of Theorem 1 First let us bound the second term in (8),

[TABLE]

If we denote

[TABLE]

then

[TABLE]

Conditionally to $\mathbf{f_{k}}(X)$ and $\mathcal{D}_{k}$ , the random variable $\sum_{j=1}^{l}\mathbb{I}_{B(\epsilon,X)}(X_{k+j})$ is binomial with probability $\mathbb{P}[{X_{k+1}\in B(\epsilon,X)}|\mathbf{f_{k}}(X),\mathcal{D}_{k}]$ . Then

[TABLE]

We can bound $P_{X}[{B(\epsilon,X)}|\mathbf{f_{k}}(X),\mathcal{D}_{k}]\leq C\mu(B(\epsilon,X)),$ a.s., where $C=\sup f(x)$ . Since $\{f_{1}\}_{k},\dots,\{f_{M}\}_{k}$ are uniformly equicontinuous, for all $\epsilon>0$ there exists $\delta=\delta(\epsilon)>0$ such that for all $m=1,\ldots,M$ , and all $k$ , $|f_{m}(x)-f_{m}(y)|<\epsilon$ if $\|x-y\|<\delta$ . By hypothesis we can assume that $\delta>\delta_{0}=\delta_{0}(\epsilon)$ . Then, $\mathcal{B}(x,\delta_{0})\subset B(\epsilon,x)$ , from where it follows,

[TABLE]

Regarding $I_{2}$ , observe that,

[TABLE]

To prove this $\lim_{\epsilon\rightarrow 0}\lim_{l\rightarrow\infty}I_{2}=0$ , by Lemma 1, it is enough to show that

[TABLE]

Because $f$ is bounded due to the dominated convergence theorem, it is enough to prove that

[TABLE]

Let $x$ such that for all $m=1,\dots,M$ , $f_{m}(x)\rightarrow f(x)$ . For such $x$ there exists $\epsilon_{0}=\epsilon_{0}(x)$ such that for all $\epsilon<\epsilon_{0}$ , the sets $\overline{B(\epsilon,x)}$ are compact a.s, and $\overline{B^{*}(\epsilon,x)}$ is compact, then by Lemma 2 $\mu(B(\epsilon,x))\rightarrow\mu(B^{*}(\epsilon,x))$ a.s. By using again the dominated convergence theorem, we obtain that, with probability one,

[TABLE]

Lastly, (12) follows from the fact that for all $\epsilon>0$ and for all $t\in B^{*}(\epsilon,X)$ , $|f(t)-f(X)|<\epsilon$ .

*Proof of Lemma 1

By Lemma 1.3 in Alonso and Brambila-Paz (1998), it is enough to prove that the sequence of $\sigma$ -algebras $\{\sigma(f_{i}(X))\}_{i}$ , $\mathbb{P}$ -approaches $\sigma(f(X))$ ; i.e., for all $B\in\sigma(f(X))$ there exists $A_{i}\in\sigma(f_{i}(X))$ such that $\mathbb{P}(A_{i}\triangle B)\rightarrow 0$ as $i\rightarrow\infty$ . Since $\sigma(f(X))=\sigma(\{f(X)^{-1}([a,b]),a,b\in\mathbb{R}\})$ is enough to consider $B=f(X)^{-1}([a,b])$ with $a<b$ . Let us consider, for $\epsilon>0$ , $B_{i}(\epsilon)=(f_{i}(X))^{-1}([a-\epsilon,b+\epsilon])$ and $H_{i}(\epsilon)=\cap^{\infty}_{j=i}\{\omega:|f_{j}(X(\omega))-f(X(\omega))|<\epsilon\}$ . Let $A_{i}=\cap_{r=1}^{\infty}B_{i}(1/r)$ , clearly $A_{i}\in\sigma(f_{i}(X))$ . For all $\epsilon_{1}>0$ ,

[TABLE]

Because the sequence of sets $\{B_{i}(1/r)^{c}\cap B\cap H_{i}(\epsilon_{1})\}_{r}$ is increasing as $r$ increase,

[TABLE]

if $1/r<\epsilon_{1}$ ,

[TABLE]

Because the sequence of sets $\{B_{i}(1/r)\cap B^{c}\cap H_{i}(\epsilon_{1})\}$ decreases as $r\rightarrow\infty$ ,

[TABLE]

if $1/r<\epsilon_{1}$ , $B_{i}(1/r)\cap H_{i}(\epsilon_{1})\subset f(X)^{-1}([a-2\epsilon_{1},b+2\epsilon_{1}]),$ then

[TABLE]

Because $\mathbb{P}(f(X)=a)=0=\mathbb{P}(f(X)=b)$ , for all $\delta>0$ there exists $\epsilon_{1}>0$ such that $\mathbb{P}(a-2\epsilon_{1}\leq f(X)<a+\epsilon_{1})+\mathbb{P}(b-\epsilon_{1}<f(X)\leq b+2\epsilon_{1})<\delta.$ By (14) and (13) for all $\delta>0$ , $\mathbb{P}\big{(}(\cap_{r=1}^{\infty}B_{i}(1/r))\triangle B\big{)}\leq\delta+\mathbb{P}(H_{i}(\epsilon_{1})^{c}).$ For all $\epsilon>0$ , $\mathbb{P}(H_{i}(\epsilon)^{c})\rightarrow 0$ as $i\rightarrow\infty$ , from where it follows $\lim_{i\rightarrow\infty}\mathbb{P}\big{(}(\cap_{r=1}^{\infty}B_{i}(1/r))\triangle B\big{)}=0.$

*Proof of Lemma 2

Let us fix $x$ such that $\mu[B^{*}(\epsilon+\delta,x)\setminus B^{*}(\epsilon-\delta,x)]\rightarrow 0$ as $\delta\rightarrow 0$ , and $f_{m}(x)\rightarrow f(x)$ for all $m$ . First we will prove that, for all $\delta>0$ , with probability one, for $k$ large enough $B(\epsilon,x)\subset B^{*}(\epsilon+\delta,x)$ . Since $f$ and $f_{m}$ are uniformly continuous on $\overline{B(\epsilon,x)}$ we can take $\mathcal{Q}=\{q_{1},\dots,q_{s}\}\subset B(\epsilon,x)$ such that for all $y\in B(\epsilon,x)$ there exists $q=q(y)\in\mathcal{Q}$ such that $|f(y)-f(q)|<\delta/3$ . Let $y\in B(\epsilon,x)$ and $q=q(y)\in\mathcal{Q}$ such that $|f(y)-f(q)|<\delta/3$ . Then,

[TABLE]

Meanwhile,

[TABLE]

Let $k$ large enough such that for all $q\in\mathcal{Q}$ , $|f(q)-f_{m}(q)|<\delta/3$ , and $|f_{m}(x)-f(x)|<\delta/3$ . Then, $|f(y)-f(x)|\leq\delta+\epsilon.$

Now let us prove that for all $\delta>0$ such that $\epsilon-3\delta>0$ , $B^{*}(\epsilon-\delta,x)\subset B(\epsilon,x)$ a.s, as $k\rightarrow\infty$ . Proceeding as before, let us consider $\mathcal{Q^{\prime}}=\{q^{\prime}_{1},\dots,q^{\prime}_{r}\}\subset B^{*}(\epsilon-\delta,x)$ such that for all $y\in B^{*}(\epsilon-\delta,x)$ there exists $q^{\prime}=q^{\prime}(y)\in\mathcal{Q^{\prime}}$ such that $|f_{m}(y)-f_{m}(q^{\prime})|<\delta/3$ for all $m$ . Let $y\in B^{*}(\epsilon-\delta,x)$ and $q^{\prime}\in\mathcal{Q^{\prime}}$ , such that $|f_{m}(x)-f_{m}(q^{\prime})|<\delta/3$ for all $m$ . Then, for all $m\in\{1,\dots,M\}$ ,

[TABLE]

Let $k$ be large enough such that for all $q^{\prime}\in\mathcal{Q}^{\prime}$ $|f(q^{\prime})-f_{m}(q^{\prime})|<\delta/3$ , and $|f_{m}(x)-f(x)|<\delta/3$ . Because $q^{\prime}\in B^{*}(\epsilon-\delta,x)$ , $|f(x)-f(q^{\prime})|<\epsilon-\delta$ , from where it follows that $y\in B(\epsilon,x)$ . Lastly $\mu[B^{*}(\epsilon+\delta,x)\setminus B^{*}(\epsilon-\delta,x)]\rightarrow 0$ implies (9). To prove (10) let $\gamma>0$ and $\kappa$ small enough such that $P_{X}(B^{*}(\epsilon+\gamma,x))-P_{X}(B^{*}(\epsilon-\gamma,x))<\kappa$ , for that $\delta$ , with probability one, we can take $k$ large enough such that, $P_{X}(B^{*}(\epsilon-\gamma,x))\leq P_{X}(B(\epsilon,x))\leq P_{X}(B^{*}(\epsilon+\gamma,x)).$

Proof of Theorem 2

Let us denote $K_{\epsilon}(x)=K(x/\epsilon)$ , then,

[TABLE]

Observe that

[TABLE]

If we bound $K_{\epsilon}(x)\leq c_{2}$ , then we get

[TABLE]

Proceeding as in Theorem 1, it is proved that $\lim_{\epsilon}\lim_{l}I_{1}=0$ a.s.

Regarding $I_{2}$ observe that,

[TABLE]

To prove that $\lim_{\epsilon\rightarrow 0}\lim_{l\rightarrow\infty}I_{2}=0$ , by Lemma 1, it is enough to show that

[TABLE]

Because $f$ and $K$ are bounded, due to dominated convergence theorem, it is enough to prove that

[TABLE]

Indeed, by using again dominated convergence theorem, together with Lemma 2, we obtain that,

[TABLE]

Meanwhile,

[TABLE]

Lastly, (15) follows from (5) and the fact that for all $t\in B^{*}(\epsilon,X)$ $|f(t)-f(X)|\leq\epsilon$ .

Proof of Theorem 3

First, let us prove that

[TABLE]

By Lemma 2 we get that for all fixed $l$ ,

[TABLE]

and

[TABLE]

Lastly, from

[TABLE]

it follows (16). Let us write

[TABLE]

Since $f(x)>0$ then $\mu(B^{*}(\epsilon,x))<\infty$ for $\epsilon$ small enough. From (17) and (18) together with (19) and $l\epsilon^{2}\rightarrow 0$ , it follows that $\lim_{l}\lim_{k}I_{2}=0$ a.s.

Let us denote $Y_{1},\dots,Y_{l}$ the random sample in $\mathcal{E}_{l}$ (i.e $X_{i+k}=Y_{i}$ for $i=1,\dots,l$ ). From (4) together with (17) and (18),

[TABLE]

Let us denote

[TABLE]

We will use the following version of the central limit theorem for triangular arrays.

Theorem (Lindeberg.) Let $Z_{l1},\dots,Z_{ll}$ independent r.v. such that for all $r=1,\dots,l$ , $\mathbb{E}(Z_{lr})=m_{lr}$ and $\text{Var}(Z_{lr})=\sigma_{lr}^{2}<\infty$ . Let us denote $V_{l}^{2}=\sum_{j=1}^{l}\sigma_{lj}^{2}.$ If for all $\alpha>0$

[TABLE]

then

[TABLE]

Let us consider $\epsilon=\epsilon_{l}\rightarrow 0$ , define for $j=1,\dots,l$ ,

[TABLE]

so

[TABLE]

From $\mu\{y:f(x)=f(y)\}=0$ we get $P_{X}(B^{*}(\epsilon,x))\rightarrow 0$ as $\epsilon\rightarrow 0$ , and then using (19) it follows,

[TABLE]

To prove (20)

[TABLE]

Since $f$ is bounded and $\mu(B^{*}(\epsilon,x))l\rightarrow\infty$ it follows that $\mu(B^{*}(\epsilon,x))V_{l}\rightarrow\infty$ . Then, with probability one, for $l$ large enough, $\mathbb{I}_{|\mathbb{I}_{B^{*}(\epsilon,x)}(Y_{j})-P_{X}(B^{*}(\epsilon,x))|\geq\alpha\mu(B^{*}(\epsilon,x))V_{l}}=0$ and then it follows (20).

Now from (21), as $l\rightarrow\infty$ ,

[TABLE]

so,

[TABLE]

where $Z=N(0,1)$ . Then, from (22),

[TABLE]

Proof of Proposition 2

Let us denote $L(\lambda)=\{y:f(y)>\lambda\}$ the $\lambda$ level set of $f$ , since $f$ is spherical for all $x\in int(S)$ being $S$ the support of $f$ , $\nabla f(x)=2h^{\prime}(\|x\|^{2})x$ and then, using that $h^{\prime}$ is a continuous function on a neighbourhood containing $\|x\|^{2}$ , for $\epsilon$ small enough, $\nabla f(x)$ is a continuous function on $B^{*}(\epsilon,x)$ . Since $f(x)>0$ $B^{*}(\epsilon,x)$ is bounded and then $f$ is Lipschitz on $B^{*}(\epsilon,x)$ for $\epsilon$ small enough. By Theorem 3.1 in Federer (1959) we can write,

[TABLE]

where $\mathcal{H}_{d-1}$ denotes the $(d-1)$ -dimensional Hausdorff measure. Let us prove that $\mathcal{H}_{d-1}(\partial L(\lambda))$ is continuous for all $\lambda$ on a neighbourhood of $f(x)$ . Observe that $\partial L(\gamma)=\{y:h(\|y\|^{2})=\gamma\}$ implies $\mathcal{H}_{d-1}(\partial L(\gamma))=\mathcal{H}_{d-1}\big{(}\partial\mathcal{B}(0,\sqrt{h^{-1}(\gamma)})\big{)}$ . Since $h$ is strictly decreasing there exists $h^{-1}$ (which is continuous on a neighbourhood of $\|x\|^{2}$ because $h$ is derivable) and $\|y\|^{2}=h^{-1}(\gamma)\rightarrow\|x\|^{2}$ as $\gamma\rightarrow f(x)$ . By the Mean Value Theorem

[TABLE]

Let us denote $M_{\epsilon}=\sup_{z\in B^{*}(\epsilon,x)}\|\nabla f(z)\|$ and $m_{\epsilon}=\inf_{z\in B^{*}(\epsilon,x)}\|\nabla f(z)\|$ , then from (23)

[TABLE]

Since $h$ is decreasing we get that $B^{*}(\epsilon,x)=\{y:|h(\|y\|^{2})-h(\|x\|^{2})|<\epsilon\}$ decreases (to $\partial L(f(x))$ ) as $\epsilon$ decreases. From the continuity of $h^{\prime}$ at $\|x\|^{2}$ it follows that $M_{\epsilon}=\sup_{z\in B^{*}(\epsilon,x)}\|\nabla f(z)\|=2\sup_{z\in B^{*}(\epsilon,x)}h^{\prime}(\|z\|^{2})\|z\|\rightarrow 2h^{\prime}(\|x\|^{2})\|x\|$ as $l\rightarrow\infty$ . Analogously, $m_{\epsilon}\rightarrow 2h^{\prime}(\|x\|^{2})\|x\|$ . Lastly, from the continuity of $\mathcal{H}_{d-1}(\partial L(\theta))$ and (24) we get that

[TABLE]

where we have used that

[TABLE]

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alonso and Brambila-Paz (1998) Alonso, A. and Brambila-Paz, F. (1998). L p superscript 𝐿 𝑝 L^{p} -Continuity of conditional expectations. Journal of Mathematical Analysis and Applications Vol. 221, pp. 161–176.
2Biau et al (2016) Biau. G, Fischer, A. Guedj, B. and Malley, J. (2016). COBRA: A combined regression strategy. Journal of Multivariate Analysis Vol. 146, pp. 18-28.
3Bellec (2017) Bellec, P. C. (2017) Optimal exponential bounds for aggregation of density estimators. Bernoulli. Vol 23(1), pp. 219–248.
4Bourel and Ghattas (2013) Bourel, M. and Ghattas, B. (2013) Aggregating density estimators: an empirical study. Open Journal of Statistics , Vol 3, pp. 334–355.
5Breiman (1996) Breiman, L. (1996) Bagging Predictors. Machine Learning Vol. 24, No. 2, pp. 123-140.
6Chacón and Duong (2018) Chacón, J.E., and Duong, T. (2018) Multivariate Kernel Smoothing and Its Applications. Chapman and Hall/CRC. ISBN 9781498763011.
7Cholaquidis et al (2016) Cholaquidis, A. Fraiman, R., Kalemkerian, J. and Llop, P. (2016) A nonlinear aggregation type classifier Journal of Multivariate Analysis Vol. 146, pp. 269–281.
8Federer (1959) Federer, H. (1959). Curvature measures. Trans. Amer. Math. Soc. Vol. 93, 418–491.