Kernel Density Estimation Bias under Minimal Assumptions

Maciej Skorski

arXiv:1901.00331·math.ST·January 3, 2019

Kernel Density Estimation Bias under Minimal Assumptions

Maciej Skorski

PDF

Open Access

TL;DR

This paper rigorously analyzes the bias in Kernel Density Estimation under minimal assumptions, highlighting the importance of kernel decay and bandwidth eigenvalues for accurate density approximation.

Contribution

It establishes necessary conditions relating kernel decay and bandwidth eigenvalues, and derives explicit bias bounds without overly restrictive assumptions.

Findings

01

Bias bounds depend on kernel decay and bandwidth eigenvalues.

02

Insufficient kernel decay can lead to unbounded estimates.

03

Minimal assumptions suffice for rigorous bias analysis.

Abstract

Kernel Density Estimation is a very popular technique of approximating a density function from samples. The accuracy is generally well-understood and depends, roughly speaking, on the kernel decay and local smoothness of the true density. However concrete statements in the literature are often invoked in very specific settings (simplified or overly conservative assumptions) or miss important but subtle points (e.g. it is common to heuristically apply Taylor's expansion globally without referring to compactness). The contribution of this paper is twofold (a) we demonstrate that, when the bandwidth is an arbitrary invertible matrix going to zero, it is necessary to keep a certain balance between the \emph{kernel decay} and \emph{magnitudes of bandwidth eigenvalues}; in fact, without the sufficient decay the estimates may not be even bounded (b) we give a rigorous derivation of bounds with…

Equations60

\hat{f} (x^{'}) = K ⋆ f_{D} (x^{'}) = \int K_{h} (x^{'} - x) f_{D} (x) \mbox d x

\hat{f} (x^{'}) = K ⋆ f_{D} (x^{'}) = \int K_{h} (x^{'} - x) f_{D} (x) \mbox d x

K_{h} (u) = ∣ h ∣^{- 1} \cdot K (h^{- 1} u)

K_{h} (u) = ∣ h ∣^{- 1} \cdot K (h^{- 1} u)

Var (\hat{f} (x^{'})) = ∣ D ∣^{- 1} \cdot Var_{x \sim f} [K_{h} (x^{'} - x)]

Var (\hat{f} (x^{'})) = ∣ D ∣^{- 1} \cdot Var_{x \sim f} [K_{h} (x^{'} - x)]

bias (\hat{f} (x^{'})) = E_{D} [K ⋆ f_{D} (x^{'}) - f (x^{'})] = K_{h} ⋆ f (x^{'}) - f (x^{'})

bias (\hat{f} (x^{'})) = E_{D} [K ⋆ f_{D} (x^{'}) - f (x^{'})] = K_{h} ⋆ f (x^{'}) - f (x^{'})

\displaystyle\mu_{k}(\mathbf{x}^{\prime},h)=\left\{\begin{array}[]{rl}f(\mathbf{x}^{\prime})\cdot(\int K(\mathbf{u})-1)\mbox{d}\mathbf{u}&\quad k=0\\ \frac{(-1)^{k}}{k!}\int K(\mathbf{u})D^{k}f(\mathbf{x}^{\prime})((h\mathbf{u})^{(k)})\,\mbox{d}\mathbf{u}&\quad k=1,2,\ldots\end{array}\right.

\displaystyle\mu_{k}(\mathbf{x}^{\prime},h)=\left\{\begin{array}[]{rl}f(\mathbf{x}^{\prime})\cdot(\int K(\mathbf{u})-1)\mbox{d}\mathbf{u}&\quad k=0\\ \frac{(-1)^{k}}{k!}\int K(\mathbf{u})D^{k}f(\mathbf{x}^{\prime})((h\mathbf{u})^{(k)})\,\mbox{d}\mathbf{u}&\quad k=1,2,\ldots\end{array}\right.

MSE = O (n^{- \frac{4}{4 + d}}), ∥ h ∥ = Θ (n^{- \frac{1}{4 + d}})

MSE = O (n^{- \frac{4}{4 + d}}), ∥ h ∥ = Θ (n^{- \frac{1}{4 + d}})

F = {f : f (u) = f_{0} (u) \forall u : ∣ u ∣ ⩽ 1}, for any f_{0} : \int_{∣ u ∣ ⩽ 1} f_{0} (u) d u < 1

F = {f : f (u) = f_{0} (u) \forall u : ∣ u ∣ ⩽ 1}, for any f_{0} : \int_{∣ u ∣ ⩽ 1} f_{0} (u) d u < 1

f \in F max [K_{h} ⋆ f (0)] = Ω (1) \cdot ∣ λ_{1} ∣^{p} / i = 1 \prod d ∣ λ_{i} ∣

f \in F max [K_{h} ⋆ f (0)] = Ω (1) \cdot ∣ λ_{1} ∣^{p} / i = 1 \prod d ∣ λ_{i} ∣

r \in Z sup r^{p} K (r) = Ω (1)

r \in Z sup r^{p} K (r) = Ω (1)

k (r) = c \cdot n \in Z, ∣ n ∣ ⩾ 2 \sum ∣ n ∣^{- p} \cdot Ψ (2∣ n ∣^{p + ℓ + d + 1} r - n)

k (r) = c \cdot n \in Z, ∣ n ∣ ⩾ 2 \sum ∣ n ∣^{- p} \cdot Ψ (2∣ n ∣^{p + ℓ + d + 1} r - n)

\int ∣ r^{j} k (r) ∣ \mbox d r ⩽ n \in Z, ∣ n ∣ ⩾ 2 \sum n^{p + j} \cdot n^{- p - d - ℓ - 1} < \infty, j = 0, \dots, d + ℓ - 1.

\int ∣ r^{j} k (r) ∣ \mbox d r ⩽ n \in Z, ∣ n ∣ ⩾ 2 \sum n^{p + j} \cdot n^{- p - d - ℓ - 1} < \infty, j = 0, \dots, d + ℓ - 1.

K_{h} ⋆ f (0) = ∣ h ∣^{- 1} \int K (- h^{- 1} u) f (u) d u = Ω (1) \cdot ∣ h ∣^{- 1} \int_{∣ u ∣ > 1} ∣ h^{- 1} u ∣^{- p} f (u) d u

K_{h} ⋆ f (0) = ∣ h ∣^{- 1} \int K (- h^{- 1} u) f (u) d u = Ω (1) \cdot ∣ h ∣^{- 1} \int_{∣ u ∣ > 1} ∣ h^{- 1} u ∣^{- p} f (u) d u

f \in F max ∣ h ∣^{- 1} \int_{∣ u ∣ > 1} ∣ h^{- 1} u ∣^{- p} f (u) d u = ∣ h ∣^{- 1} ∣ u ∣ > 1 sup ∣ h^{- 1} u ∣^{- p}

f \in F max ∣ h ∣^{- 1} \int_{∣ u ∣ > 1} ∣ h^{- 1} u ∣^{- p} f (u) d u = ∣ h ∣^{- 1} ∣ u ∣ > 1 sup ∣ h^{- 1} u ∣^{- p}

K_{h} ⋆ f^{'} (0) = Ω (1) ∣ h ∣^{- 1} ∣ u ∣ = 1 sup ∣ h^{- 1} u ∣^{- p}

K_{h} ⋆ f^{'} (0) = Ω (1) ∣ h ∣^{- 1} ∣ u ∣ = 1 sup ∣ h^{- 1} u ∣^{- p}

K_{h} ⋆ f^{'} (0) = Ω (1) \cdot ∣ h ∣^{- 1} (∣ u ∣ = 1 in f ∣ h^{- 1} u ∣)^{- p}

K_{h} ⋆ f^{'} (0) = Ω (1) \cdot ∣ h ∣^{- 1} (∣ u ∣ = 1 in f ∣ h^{- 1} u ∣)^{- p}

K_{h} ⋆ f^{'} (0) = Ω (1) \cdot ∣ λ_{1} ∣^{p} i = 1 \prod d ∣ λ_{i} ∣^{- 1}

K_{h} ⋆ f^{'} (0) = Ω (1) \cdot ∣ λ_{1} ∣^{p} i = 1 \prod d ∣ λ_{i} ∣^{- 1}

bias (\hat{f} (x^{'})) - i = 0 \sum k μ_{k} (x^{'}, h) = R_{k} (x^{'}, h) \cdot ∥ h ∥^{2}

bias (\hat{f} (x^{'})) - i = 0 \sum k μ_{k} (x^{'}, h) = R_{k} (x^{'}, h) \cdot ∥ h ∥^{2}

∣ R_{k} (x^{'}, h) ∣ ⩽ 2∣ h ∣^{- 1} ∣ u ∣ > δ /∥ h ∥ sup K (u) + C \cdot μ_{K} (k) \cdot \frac{1}{k !} ∣ u ∣ ⩽ δ sup ∥ D^{k} f (x^{'} + u) ∥

∣ R_{k} (x^{'}, h) ∣ ⩽ 2∣ h ∣^{- 1} ∣ u ∣ > δ /∥ h ∥ sup K (u) + C \cdot μ_{K} (k) \cdot \frac{1}{k !} ∣ u ∣ ⩽ δ sup ∥ D^{k} f (x^{'} + u) ∥

g (x + h) = g (x) + j = 1 \sum k \frac{D ^{j} g ( x ) ( h ^{(j)} )}{j !} + R_{u} (h)

g (x + h) = g (x) + j = 1 \sum k \frac{D ^{j} g ( x ) ( h ^{(j)} )}{j !} + R_{u} (h)

R_{x} (h) = \int_{0}^{1} \frac{( 1 - t ) ^{k - 1}}{( k - 1 )!} (D^{k} g (x + t \cdot h) - D^{k} g (x)) (h^{(k)}) d t .

R_{x} (h) = \int_{0}^{1} \frac{( 1 - t ) ^{k - 1}}{( k - 1 )!} (D^{k} g (x + t \cdot h) - D^{k} g (x)) (h^{(k)}) d t .

K_{h} ⋆ f (x^{'}) - f (x^{'}) = \int K (u) (f (x^{'} - h u) - f (x^{'})) \mbox d u

K_{h} ⋆ f (x^{'}) - f (x^{'}) = \int K (u) (f (x^{'} - h u) - f (x^{'})) \mbox d u

I_{1} = ∣ h ∣^{- 1} \int_{∣ u ∣ > δ} K (h^{- 1} u) (f (x^{'} - u) - f (x^{'})) d u

I_{1} = ∣ h ∣^{- 1} \int_{∣ u ∣ > δ} K (h^{- 1} u) (f (x^{'} - u) - f (x^{'})) d u

I_{1} ⩽ 2∣ h ∣^{- 1} ψ (δ ∥ h ∥^{- 1} ∣)

I_{1} ⩽ 2∣ h ∣^{- 1} ψ (δ ∥ h ∥^{- 1} ∣)

I_{2} = ∣ h ∣^{- 1} \int_{∣ u ∣ ⩽ δ} K (h^{- 1} u) D^{k} f (x^{'} - t u) (u)^{(k)} d u

I_{2} = ∣ h ∣^{- 1} \int_{∣ u ∣ ⩽ δ} K (h^{- 1} u) D^{k} f (x^{'} - t u) (u)^{(k)} d u

I_{2} = t^{- k - d} ∣ h ∣^{- 1} \int_{∣ v ∣ ⩽ δ t} K (t^{- 1} h^{- 1} v) D^{k} f (x^{'} - v) (v)^{(k)} d v

I_{2} = t^{- k - d} ∣ h ∣^{- 1} \int_{∣ v ∣ ⩽ δ t} K (t^{- 1} h^{- 1} v) D^{k} f (x^{'} - v) (v)^{(k)} d v

D^{k} f (x^{'} - v) (v)^{(k)} = O (1) D^{k} f (x^{'} - v) v ∣^{k}

D^{k} f (x^{'} - v) (v)^{(k)} = O (1) D^{k} f (x^{'} - v) v ∣^{k}

I_{2} ⩽ O (1) \cdot t^{- k - d} ∣ h ∣^{- 1} \int_{∣ v ∣ ⩽ δ t} ∣ v ∣^{k} K (t^{- 1} h^{- 1} v) D^{k} f (x^{'} - v) d v

I_{2} ⩽ O (1) \cdot t^{- k - d} ∣ h ∣^{- 1} \int_{∣ v ∣ ⩽ δ t} ∣ v ∣^{k} K (t^{- 1} h^{- 1} v) D^{k} f (x^{'} - v) d v

I_{2}

I_{2}

⩽ O (1) \cdot B (δ) \cdot \int_{∣ h u ∣ ⩽ δ} ∣ h u ∣^{k} K (u) d u

⩽ O (1) \cdot B (δ) \cdot ∥ h ∥^{k} \int_{∣ h u ∣ ⩽ δ} ∣ u ∣^{k} K (u) d u

I_{2} ⩽ O (1) \cdot B (δ) \cdot μ_{K} (k) \cdot ∥ h ∥^{k}

I_{2} ⩽ O (1) \cdot B (δ) \cdot μ_{K} (k) \cdot ∥ h ∥^{k}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Sparse and Compressive Sensing Techniques · Image and Signal Denoising Methods

Full text

11institutetext: DELL

11email: [email protected]

Kernel Density Estimation Bias under Minimal Assumptions

Maciej Skorski 11

Abstract

Kernel Density Estimation is a very popular technique of approximating a density function from samples. The accuracy is generally well-understood and depends, roughly speaking, on the kernel decay and local smoothness of the true density. However concrete statements in the literature are often invoked in very specific settings (simplified or overly conservative assumptions) or miss important but subtle points (e.g. it is common to heuristically apply Taylor’s expansion globally without referring to compactness).

The contribution of this paper is twofold

(a)

we demonstrate that it is necessary to keep a certain balance between the kernel decay and magnitudes of bandwidth eigenvalues; otherwise, regardless of kernel smoothness and moments (!), the estimates are not bounded. 2. (b)

we give a rigorous derivation of bounds with explicit constants for the bias, under possibly minimal assumptions. This connects the kernel decay, bandwidth norm, bandwidth determinant and (local) density smoothness.

It has been folklore that the issue with Taylor’s formula can be fixed with more complicated assumptions on the density (for example p. 95 of ”Kernel Smoothing” by Wand and Jones); we show that this is actually not necessary and can be handled by the kernel decay alone.

Keywords:

Statistical Learning, Kernel Density Estimation

1 Introduction

1.1 Kernel Density Estimation

Density estimation by convolutions

Density estimation is the fundamental problem of approximating a probability density function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ given a set of $n$ iid samples $\mathcal{D}\subset\mathbb{R}^{d}$ . The popular approach, called Kernel Density Estimation, uses a convolution of a suitable filter $K_{h}$ (called kernel) with the sample distribution $f_{\mathcal{D}}=|\mathcal{D}|^{-1}\sum_{\mathbf{x}\in\mathcal{D}}\delta_{\mathbf{x}}$ . Formally, the KDE estimator is defined by

[TABLE]

and in this form is credited to Rosenblatt and Parzen [Ros56, Par62]. Usually one uses rescaled versions of a base kernel

[TABLE]

where the scale parameter $h$ is a $d\times d$ invertible matrix called bandwidth and $|h|$ is the matrix determinant (for simplicity one often considers diagonal $h$ ). Under certain assumptions on the kernel (rapid decay, moments) and the density (smoothness), the KDE estimator is consistent asymptotically, that is when $h\to 0$ . Intuitively, the convergence follows as for $\mathbf{x}^{\prime}$ close to $\mathbf{x}$ we have $\mathbf{f}(\mathbf{x})\approx\mathbf{f}(\mathbf{x^{\prime}})$ by the smoothness of $f$ , and for larger $\mathbf{x}$ the possible bias is penalized by the scaled kernel as $|h^{-1}(\mathbf{x}^{\prime}-\mathbf{x})|$ is big for small $h$ . Specific bounds depends on the kernel and local smoothnes of $f$ .

Estimator Accuracy

The variance of the estimator is quite easy to compute

[TABLE]

and (under some assumptions on $f$ ) is of order $O(|\mathcal{D}|^{-1}\|h\|^{-d})$ with the hidden constant dependent on $K$ . In turn, bias is obtained by exchanging expectation and the convolution integral

[TABLE]

Intuitively, it captures by how much the convolution perturbs the density; this in turn depends on how the kernel interacts with the local series expansion of $f$ . Expanding $f$ around $\mathbf{x}^{\prime}$ and parametrizing $\mathbf{x}=\mathbf{x}^{\prime}-h\mathbf{u}$ one obtains a series $\mathsf{bias}(\hat{f}(\mathbf{x}^{\prime}))=\sum_{k}\mu_{k}(\mathbf{x}^{\prime},h)$ where

[TABLE]

the $k$ -th derivative $D^{k}$ is understood as a $k$ -linear map from $(\mathbb{R}^{d})^{k}$ to $\mathbb{R}^{d}$ and $\mathbf{v}^{(k)}$ denotes the vector $\mathbf{v}$ stacked $k$ -times; here one needs to some assumptions on the kernel $K$ and derivatives $D^{j}$ to guarantee that the integrals exist.

In general $\mu_{k}(\mathbf{x}^{\prime},h)=O(\mu_{K}(k)\|h\|^{k})=O(\|h\|^{k})$ , so one designs the filter to eliminate low-order terms:

(a)

(unit mass) when $\int K(\mathbf{u})=1$ , the bias is of order $O(\|h\|)$ 2. (b)

(symmetry) if in addition $K(\mathbf{u})=K(\mathbf{-u})$ , the bias improves to $O(\|h\|^{2})$ 111 $\mu_{1}$ is a weighted sum of terms $\int\ K(\mathbf{u})\mathbf{u}_{i}\,\mbox{d}\mathbf{u}$ , which are zero when $K$ is symmetric..

The best, over the choice of $h$ , MSE error equals then (pointwise, for fixed $\mathbf{x}^{\prime}$ )

[TABLE]

This improves upon histograms (they have error $O\left(n^{-\frac{2}{2+d}}\right)$ ). Cacoullos [Cac64] gives a rigorous derivation of the bias and variance formulas for diagonal $h$ .

Better accuracy with higher-order kernels

One can farther reduce bias by eliminating more of the expansion terms. Such kernels are also called higher-order kernels and compensate the negative impact of dimension $d$ on the variance (curse of dimensionality). If the property $\mu_{k}=0$ holds for $k=1,\ldots,v-1$ one says the kernel is of order $v$ ; the bias is of order $\|h\|^{v}$ which (for the optimized bandwith) gives the mse error of order $O\left(n^{-\frac{2v}{2v+d}}\right)$ [EH09]. Higher-order kernels can be built as products of single-dimension higher-order kernels; the problem of developing one-dimensional filters from Taylor expansions was studied in [MMMY97].

1.2 Contribution of this paper

The fundamental properties of kernel estimators, including bias and variance, are generally well understood. However the concrete statements in the existing literature are based on various assumptions; sometimes they are overly simplistic, sometimes too conservative, and finally sometimes important assumptions are ignored. We mention few prominent examples, to be specific:

•

Bandwidth is scalar or diagonal [Cac64], or is given by rescaling a fixed matrix [Jia17, Cac64]

•

For second-order kernels, smoothness of the density of order $3$ or higher is assumed [ZD13, Cac64]

•

Taylor’s expansion is used globaly which suggest that the kernel decay is not needed [YC18, EH09] without referring to compactness or taking compact arguments to hold globally [DUO05]. In fact without sufficient decay the estimates are not even bounded (we will discuss a general example).

The purpose of this paper is to give a rigorous bounds on the bias, under minimal constraints on the bandwidth matrix and the kernel decay. Particularly, we discuss what happens when the bandwidth elements goes to zero at different rates.

2 Results

2.1 Necessary kernel decay and bandwidth eigenvalues balance

The $j$ -th moment of the kernel $K$ is defined as $\mu_{K}(k)=\int|\mathbf{u}|^{j}|K(\mathbf{u})|\mbox{d}\mathbf{u}$ . The following construction shows that, to reconstruct the density from its behavior in a fixed neighborhood, the kernel decay and discrepancy of eigenvalues of $h$ must be balanced. This is true regardless of smoothness and moments of $K$ (note that bounded moments do not imply decay!).

Theorem 2.1 (Lower bound on bias in terms of kernel decay and bandwidth eigenvalues)

For any $\ell\geqslant 0,p\geqslant 0$ there exists a radial kernel $K$ on $\mathbb{R}^{d}$ which is infinitely differentiable, has finite $\ell$ first moments, and decay rate at infinity not faster than $\|\mathbf{u}\|^{-p}$ with the following property: over the class of densities $f$ with given behavior on the unit ball

[TABLE]

the density estimation at $\mathbf{0}$ is lower-bounded by

[TABLE]

where $\{\lambda_{i}\}_{i}$ are eigenvalues of $h$ ordered so that $|\lambda_{1}|\geqslant\ldots\geqslant|\lambda_{d}|$ .

Proof

Consider a non-negative ”radial” kernel $K(\mathbf{u})=k(|\mathbf{u}|)$ on $\mathbb{R}^{d}$ where $k$ is a non-negative real function such that

[TABLE]

for some fixed $p\geqslant 0$ , the supremum being over integers. For example, let $\Psi(r)=\exp(-1/(1-r)^{2})\mathbf{1}_{-1\leqslant r\leqslant 1}(r)$ be the standard bump function. Now for some constant $c$ consider

[TABLE]

the sum of shifted and rescaled bump functions - the $n$ -th is component centered at $n$ with the interval width $|n|^{-p-d-\ell-1}\leqslant 1$ and the spike of magnitude $|n|^{p}$ . Clearly $k$ is analytic because each point is covered by finitely many smooth components (actually by at most one) Moreover, $k$ has integrable moments up to order $d+\ell-1$

[TABLE]

It is well-known that for radial functions $K(\mathbf{u})=k(|\mathbf{u}|)$ it holds $\int|\mathbf{u}|^{j}|K(\mathbf{u})|\mbox{d}\mathbf{u}=O(1)\int_{0}^{\infty}r^{d+j-1}k(r)\mbox{d}r$ (by the spherical parametrization). Therefore $K$ defined from $k$ our $k$ is indeed integrable and has all moments up to $\ell$ ; by manipulating $c$ we can normalize the integral to $1$ . Note also that $K(\mathbf{u})=k(|\mathbf{u}|)$ is infinitely differentiable in $\mathbf{u}$ , also at [math] because $K=0$ in the neighborhood of zero by definition. Now, since $K$ and $f$ are positive

[TABLE]

The class $\mathcal{F}$ represents all functions with same behavior on the unit ball as the function $f_{0}$ . The maximum of the expression above over this class equals

[TABLE]

Note that the supremum is achieved on the boundary (consider scaling by a scalar $\mathbf{u}:=\lambda\mathbf{u}$ ). We have then for some $f^{\prime}\in\mathcal{F}$

[TABLE]

This is equivalent to

[TABLE]

We can use the max norm $|\cdot|=|\cdot|_{\infty}$ because of equivalence of all vector norms. Let $\lambda_{i}$ where $|\lambda_{1}|\geqslant\ldots\geqslant|\lambda_{d}|$ be the eigenvalues of $h$ . Then $\lambda_{i}^{-1}$ are eigenvectors of $h$ . Let $\mathbf{u}$ be the vector such that $h^{-1}\mathbf{u}=\frac{1}{\lambda_{1}}\mathbf{u}$ ; it follows that $\inf_{|\mathbf{u}|=1}|h^{-1}\mathbf{u}|\leqslant|\lambda_{1}|^{-1}$ ; since $|h|=\prod_{i=1}^{d}\lambda_{i}$ one obtains

[TABLE]

this finishes the proof.

From Equation 9 it is clear that when $h\to 0$ one needs not only $K$ to decay at least as fast as the negative power of $d$ (with $p<d$ and $\lambda_{1}=\ldots=\lambda_{d}\to 0$ the estimate is unbounded) but also $h$ to keep some balance between the bandwidth eigenvalues. We note that in [Cac64], for the simpler case of product kernels and diagonal bandwidth, one assumes that $h=(\lambda+o(1))\cdot h_{0}$ where $h_{0}$ is a positive diagonal matrix and $\lambda\to 0$ ; this implies that eigenvalues are of comparable order.

Remark 1

Note that the kernel in this argument is non-compact, but has moments up to an arbitrary fixed order.

2.2 Multivariate KDE bias under general bandwidths

We give a fairly general bounds on the bias below. Note that formulas often cited in the literature, such as p. 95 in [WJ94] are limited to compact $K$ . The authors suggest that fixing this can be done at the cost of assuming more on the density $f$ 222p. 95 in [WJ94] in : ”the assumptions of the compact support of $K$ can be removed by imposing more complicated conditions on $f$ ” We show that that extra conditions on $f$ are actually not necessary, and kernels with non-compact support can be handled by the decay.

Theorem 2.2 (General bias formula)

Let $K$ be a $k$ -th order kernel with bounded moments up to $k$ . Suppose that $f$ has $k$ -th derivatives bounded in a $\delta$ -neighborhood of $\mathbf{x}^{\prime}$ . Then the remainder in the bias expansion equals

[TABLE]

where $\mu_{k}(\mathbf{x}^{\prime},h)$ are defined in Equation 7 and for any $\delta$

[TABLE]

where the constant $C$ depends only on the chosen norm.

Corollary 1 (Bias under $k$ -th order kernels)

If $K$ , $f$ are as in Theorem 2.2, $h\to 0$ and

(a)

$K$ * decays at infinity faster than the negative power of $d$ * 2. (b)

$\|h\|^{d}/|h|=O(1)$ **

then the remainder is $o(\|h\|^{k})$ .

Remark 2 (Balance of eigenvalues)

Note that $\|h\|^{d}/|h|$ can be easily unbounded (consider diagonal matrix with different entries). In the opposite direction by Hadamard’s Inequality [Lan14] we have that $|h|/\|h\|^{d}$ is bounded. If $\sigma_{i}$ are eigenvalues of $h$ then $\max_{i=1,\ldots,d}|\sigma_{i}|\leqslant\|h\|$ and $|h|=\prod_{i=1}^{d}\sigma_{i}$ ; thus $|h|/\|h\|^{d}=O(1)$ implies that all eigenvalues are of same magnitude.

Remark 3

One can allow for larger discrepancy between $\|h\|$ and $|h|$ with faster decay of the kernel.

In the proof we will use the multivariate Taylor formula with the integral remainder form. To get terms up to the $k$ -th order we assume that $k$ -th derivatives exist and are locally bounded. It might be possible to further weakened the assumptions, e.g. to ue the Taylor formula when $(k-1)$ -th derivatives are absolutely continuous [AD01].

Lemma 1 (Multivariate Taylor’s Formula [Con06, AD01])

Let $V$ be a compact convex set in $\mathbb{R}^{d}$ and let $g:V\rightarrow\mathbb{R}$ have absolutely continuous $(k-1)$ -th derivatives. Then for any $\mathbf{x}\in V$ and $h$ such that $\mathbf{x}+h\in V$

[TABLE]

where

[TABLE]

Proof (of Theorem)

We split the convolution integral

[TABLE]

integral in two regions: $|\mathbf{u}|>\delta$ and $|\mathbf{u}|\leqslant\delta$ where $\delta=\delta(h)$ will depend on $h$ . The general strategy is as follows: ”big” values of $\mathbf{u}$ are handled by the decay of $K$ , whereas ”small” are worked out by the smoothness of $f$ . Consider first ”big”

[TABLE]

Let $\psi(r)=\sup_{|\mathbf{u}|>r}K(\mathbf{u})$ . By the properties of the matrix norm $|\mathbf{u}|\leqslant\|h\|\cdot|h^{-1}\mathbf{u}|$ . Therefore in the region of integration $|h^{-1}\mathbf{u}|\geqslant|\mathbf{u}|/\|h\|\geqslant\delta\|h\|^{-1}$ and $K(h^{-1}\mathbf{u})\leqslant\psi(\delta\|h\|^{-1})$ since $\psi$ is decreasing. Since $\int|f|=1$ , we obtain

[TABLE]

Consider now the case of ”small” values of $\mathbf{u}$ . We assume that $\delta=o(1)$ so that we can apply the Taylor formula. The main terms $\mu_{k}(\mathbf{x}^{\prime},h)$ are as in Equation 7 and are well defined provided that $|\mathbf{u}|^{j}K(\mathbf{u})$ is absolutely integrable and that $D^{j}f$ exists at $\mathbf{x}^{\prime}$ . It suffices to consider the remainder. Let

[TABLE]

for any fixed $t\in[0,1]$ . If we bound this integral uniformly in $t$ , let’s say $|I|\leqslant M$ then according to Lemma 1 we will get $\frac{1}{k!}2M$ . Let’s change variables $\mathbf{v}=t\mathbf{u}$ . We have

[TABLE]

By the properties of multilinear maps

[TABLE]

where the constant $O(1)$ depends only on the chosen norms. We obtain

[TABLE]

Now if $\left\|D^{k}f(\mathbf{x}^{\prime}-\mathbf{v})\right\|\leqslant B(\delta)$ for $|\mathbf{v}|\leqslant\delta$ , we obtain

[TABLE]

where we changed variables $\mathbf{v}=t\cdot h\mathbf{u}$ and used the norm inequality $|h\mathbf{u}|\leqslant\|h\||\mathbf{u}|$ . Since $\|\mathbf{u}|^{k}K(\mathbf{u})$ is integrable, we obtain

[TABLE]

The result follows by combining Equation 13 and Equation 15.

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AD 01] G.A. Anastassiou and S.S. Dragomir, On some estimates of the remainder in taylor’s formula , Journal of Mathematical Analysis and Applications 263 (2001), no. 1, 246 – 263.
2[Cac 64] Theophilos Cacoullos, Estimation of a multivariate density , https://www.ism.ac.jp/editsec/aism/pdf/018_2_0179.pdf , 1964.
3[Con 06] Brian Conrad, Higher derivatives and taylor’s formula via multilinear maps , http://math.stanford.edu/~conrad/diffgeom Page/handouts/taylor.pdf , 2006.
4[DUO 05] Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation , Journal of Multivariate Analysis 93 (2005), no. 2, 417 – 433.
5[EH 09] Bruce E. Hansen, https://www.ssc.wisc.edu/~bhansen/718/Non Parametrics 1.pdf , 2009.
6[Jia 17] Heinrich Jiang, Uniform convergence rates for kernel density estimation , Proceedings of the 34th International Conference on Machine Learning, vol. 70, PMLR, 2017, pp. 1694–1703.
7[Lan 14] Kenneth Lange, Hadamard’s determinant inequality , The American Mathematical Monthly 121 (2014), no. 3, 258–259.
8[MMMY 97] Torsten Möller, Raghu Machiraju, Klaus Mueller, and Roni Yagel, Evaluation and design of filters using a taylor series expansion , IEEE Transactions on Visualization and Computer Graphics 3 (1997), no. 2, 184–199.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Kernel Density Estimation Bias under Minimal Assumptions

Abstract

Keywords:

1 Introduction

1.1 Kernel Density Estimation

Density estimation by convolutions

Estimator Accuracy

Better accuracy with higher-order kernels

1.2 Contribution of this paper

2 Results

2.1 Necessary kernel decay and bandwidth eigenvalues balance

Theorem 2.1 (Lower bound on bias in terms of kernel decay and bandwidth eigenvalues)

Proof

Remark 1

2.2 Multivariate KDE bias under general bandwidths

Theorem 2.2 (General bias formula)

Corollary 1 (Bias under kkk-th order kernels)

Remark 2 (Balance of eigenvalues)

Remark 3

Lemma 1 (Multivariate Taylor’s Formula [Con06, AD01])

Proof (of Theorem)

Corollary 1 (Bias under $k$ -th order kernels)