Kernel Density Estimation Bias under Minimal Assumptions
Maciej Skorski

TL;DR
This paper rigorously analyzes the bias in Kernel Density Estimation under minimal assumptions, highlighting the importance of kernel decay and bandwidth eigenvalues for accurate density approximation.
Contribution
It establishes necessary conditions relating kernel decay and bandwidth eigenvalues, and derives explicit bias bounds without overly restrictive assumptions.
Findings
Bias bounds depend on kernel decay and bandwidth eigenvalues.
Insufficient kernel decay can lead to unbounded estimates.
Minimal assumptions suffice for rigorous bias analysis.
Abstract
Kernel Density Estimation is a very popular technique of approximating a density function from samples. The accuracy is generally well-understood and depends, roughly speaking, on the kernel decay and local smoothness of the true density. However concrete statements in the literature are often invoked in very specific settings (simplified or overly conservative assumptions) or miss important but subtle points (e.g. it is common to heuristically apply Taylor's expansion globally without referring to compactness). The contribution of this paper is twofold (a) we demonstrate that, when the bandwidth is an arbitrary invertible matrix going to zero, it is necessary to keep a certain balance between the \emph{kernel decay} and \emph{magnitudes of bandwidth eigenvalues}; in fact, without the sufficient decay the estimates may not be even bounded (b) we give a rigorous derivation of bounds with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Sparse and Compressive Sensing Techniques · Image and Signal Denoising Methods
11institutetext: DELL
11email: [email protected]
Kernel Density Estimation Bias under Minimal Assumptions
Maciej Skorski 11
Abstract
Kernel Density Estimation is a very popular technique of approximating a density function from samples. The accuracy is generally well-understood and depends, roughly speaking, on the kernel decay and local smoothness of the true density. However concrete statements in the literature are often invoked in very specific settings (simplified or overly conservative assumptions) or miss important but subtle points (e.g. it is common to heuristically apply Taylor’s expansion globally without referring to compactness).
The contribution of this paper is twofold
- (a)
we demonstrate that it is necessary to keep a certain balance between the kernel decay and magnitudes of bandwidth eigenvalues; otherwise, regardless of kernel smoothness and moments (!), the estimates are not bounded. 2. (b)
we give a rigorous derivation of bounds with explicit constants for the bias, under possibly minimal assumptions. This connects the kernel decay, bandwidth norm, bandwidth determinant and (local) density smoothness.
It has been folklore that the issue with Taylor’s formula can be fixed with more complicated assumptions on the density (for example p. 95 of ”Kernel Smoothing” by Wand and Jones); we show that this is actually not necessary and can be handled by the kernel decay alone.
Keywords:
Statistical Learning, Kernel Density Estimation
1 Introduction
1.1 Kernel Density Estimation
Density estimation by convolutions
Density estimation is the fundamental problem of approximating a probability density function given a set of iid samples . The popular approach, called Kernel Density Estimation, uses a convolution of a suitable filter (called kernel) with the sample distribution . Formally, the KDE estimator is defined by
[TABLE]
and in this form is credited to Rosenblatt and Parzen [Ros56, Par62]. Usually one uses rescaled versions of a base kernel
[TABLE]
where the scale parameter is a invertible matrix called bandwidth and is the matrix determinant (for simplicity one often considers diagonal ). Under certain assumptions on the kernel (rapid decay, moments) and the density (smoothness), the KDE estimator is consistent asymptotically, that is when . Intuitively, the convergence follows as for close to we have by the smoothness of , and for larger the possible bias is penalized by the scaled kernel as is big for small . Specific bounds depends on the kernel and local smoothnes of .
Estimator Accuracy
The variance of the estimator is quite easy to compute
[TABLE]
and (under some assumptions on ) is of order with the hidden constant dependent on . In turn, bias is obtained by exchanging expectation and the convolution integral
[TABLE]
Intuitively, it captures by how much the convolution perturbs the density; this in turn depends on how the kernel interacts with the local series expansion of . Expanding around and parametrizing one obtains a series where
[TABLE]
the -th derivative is understood as a -linear map from to and denotes the vector stacked -times; here one needs to some assumptions on the kernel and derivatives to guarantee that the integrals exist.
In general , so one designs the filter to eliminate low-order terms:
- (a)
(unit mass) when , the bias is of order 2. (b)
(symmetry) if in addition , the bias improves to 111 is a weighted sum of terms , which are zero when is symmetric..
The best, over the choice of , MSE error equals then (pointwise, for fixed )
[TABLE]
This improves upon histograms (they have error ). Cacoullos [Cac64] gives a rigorous derivation of the bias and variance formulas for diagonal .
Better accuracy with higher-order kernels
One can farther reduce bias by eliminating more of the expansion terms. Such kernels are also called higher-order kernels and compensate the negative impact of dimension on the variance (curse of dimensionality). If the property holds for one says the kernel is of order ; the bias is of order which (for the optimized bandwith) gives the mse error of order [EH09]. Higher-order kernels can be built as products of single-dimension higher-order kernels; the problem of developing one-dimensional filters from Taylor expansions was studied in [MMMY97].
1.2 Contribution of this paper
The fundamental properties of kernel estimators, including bias and variance, are generally well understood. However the concrete statements in the existing literature are based on various assumptions; sometimes they are overly simplistic, sometimes too conservative, and finally sometimes important assumptions are ignored. We mention few prominent examples, to be specific:
- •
Bandwidth is scalar or diagonal [Cac64], or is given by rescaling a fixed matrix [Jia17, Cac64]
- •
For second-order kernels, smoothness of the density of order or higher is assumed [ZD13, Cac64]
- •
Taylor’s expansion is used globaly which suggest that the kernel decay is not needed [YC18, EH09] without referring to compactness or taking compact arguments to hold globally [DUO05]. In fact without sufficient decay the estimates are not even bounded (we will discuss a general example).
The purpose of this paper is to give a rigorous bounds on the bias, under minimal constraints on the bandwidth matrix and the kernel decay. Particularly, we discuss what happens when the bandwidth elements goes to zero at different rates.
2 Results
2.1 Necessary kernel decay and bandwidth eigenvalues balance
The -th moment of the kernel is defined as . The following construction shows that, to reconstruct the density from its behavior in a fixed neighborhood, the kernel decay and discrepancy of eigenvalues of must be balanced. This is true regardless of smoothness and moments of (note that bounded moments do not imply decay!).
Theorem 2.1 (Lower bound on bias in terms of kernel decay and bandwidth eigenvalues)
For any there exists a radial kernel on which is infinitely differentiable, has finite first moments, and decay rate at infinity not faster than with the following property: over the class of densities with given behavior on the unit ball
[TABLE]
the density estimation at is lower-bounded by
[TABLE]
where are eigenvalues of ordered so that .
Proof
Consider a non-negative ”radial” kernel on where is a non-negative real function such that
[TABLE]
for some fixed , the supremum being over integers. For example, let be the standard bump function. Now for some constant consider
[TABLE]
the sum of shifted and rescaled bump functions - the -th is component centered at with the interval width and the spike of magnitude . Clearly is analytic because each point is covered by finitely many smooth components (actually by at most one) Moreover, has integrable moments up to order
[TABLE]
It is well-known that for radial functions it holds (by the spherical parametrization). Therefore defined from our is indeed integrable and has all moments up to ; by manipulating we can normalize the integral to . Note also that is infinitely differentiable in , also at [math] because in the neighborhood of zero by definition. Now, since and are positive
[TABLE]
The class represents all functions with same behavior on the unit ball as the function . The maximum of the expression above over this class equals
[TABLE]
Note that the supremum is achieved on the boundary (consider scaling by a scalar ). We have then for some
[TABLE]
This is equivalent to
[TABLE]
We can use the max norm because of equivalence of all vector norms. Let where be the eigenvalues of . Then are eigenvectors of . Let be the vector such that ; it follows that ; since one obtains
[TABLE]
this finishes the proof.
From Equation 9 it is clear that when one needs not only to decay at least as fast as the negative power of (with and the estimate is unbounded) but also to keep some balance between the bandwidth eigenvalues. We note that in [Cac64], for the simpler case of product kernels and diagonal bandwidth, one assumes that where is a positive diagonal matrix and ; this implies that eigenvalues are of comparable order.
Remark 1
Note that the kernel in this argument is non-compact, but has moments up to an arbitrary fixed order.
2.2 Multivariate KDE bias under general bandwidths
We give a fairly general bounds on the bias below. Note that formulas often cited in the literature, such as p. 95 in [WJ94] are limited to compact . The authors suggest that fixing this can be done at the cost of assuming more on the density 222p. 95 in [WJ94] in : ”the assumptions of the compact support of can be removed by imposing more complicated conditions on ” We show that that extra conditions on are actually not necessary, and kernels with non-compact support can be handled by the decay.
Theorem 2.2 (General bias formula)
Let be a -th order kernel with bounded moments up to . Suppose that has -th derivatives bounded in a -neighborhood of . Then the remainder in the bias expansion equals
[TABLE]
where are defined in Equation 7 and for any
[TABLE]
where the constant depends only on the chosen norm.
Corollary 1 (Bias under -th order kernels)
If , are as in Theorem 2.2, and
- (a)
* decays at infinity faster than the negative power of * 2. (b)
**
then the remainder is .
Remark 2 (Balance of eigenvalues)
Note that can be easily unbounded (consider diagonal matrix with different entries). In the opposite direction by Hadamard’s Inequality [Lan14] we have that is bounded. If are eigenvalues of then and ; thus implies that all eigenvalues are of same magnitude.
Remark 3
One can allow for larger discrepancy between and with faster decay of the kernel.
In the proof we will use the multivariate Taylor formula with the integral remainder form. To get terms up to the -th order we assume that -th derivatives exist and are locally bounded. It might be possible to further weakened the assumptions, e.g. to ue the Taylor formula when -th derivatives are absolutely continuous [AD01].
Lemma 1 (Multivariate Taylor’s Formula [Con06, AD01])
Let be a compact convex set in and let have absolutely continuous -th derivatives. Then for any and such that
[TABLE]
where
[TABLE]
Proof (of Theorem)
We split the convolution integral
[TABLE]
integral in two regions: and where will depend on . The general strategy is as follows: ”big” values of are handled by the decay of , whereas ”small” are worked out by the smoothness of . Consider first ”big”
[TABLE]
Let . By the properties of the matrix norm . Therefore in the region of integration and since is decreasing. Since , we obtain
[TABLE]
Consider now the case of ”small” values of . We assume that so that we can apply the Taylor formula. The main terms are as in Equation 7 and are well defined provided that is absolutely integrable and that exists at . It suffices to consider the remainder. Let
[TABLE]
for any fixed . If we bound this integral uniformly in , let’s say then according to Lemma 1 we will get . Let’s change variables . We have
[TABLE]
By the properties of multilinear maps
[TABLE]
where the constant depends only on the chosen norms. We obtain
[TABLE]
Now if for , we obtain
[TABLE]
where we changed variables and used the norm inequality . Since is integrable, we obtain
[TABLE]
The result follows by combining Equation 13 and Equation 15.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AD 01] G.A. Anastassiou and S.S. Dragomir, On some estimates of the remainder in taylor’s formula , Journal of Mathematical Analysis and Applications 263 (2001), no. 1, 246 – 263.
- 2[Cac 64] Theophilos Cacoullos, Estimation of a multivariate density , https://www.ism.ac.jp/editsec/aism/pdf/018_2_0179.pdf , 1964.
- 3[Con 06] Brian Conrad, Higher derivatives and taylor’s formula via multilinear maps , http://math.stanford.edu/~conrad/diffgeom Page/handouts/taylor.pdf , 2006.
- 4[DUO 05] Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation , Journal of Multivariate Analysis 93 (2005), no. 2, 417 – 433.
- 5[EH 09] Bruce E. Hansen, https://www.ssc.wisc.edu/~bhansen/718/Non Parametrics 1.pdf , 2009.
- 6[Jia 17] Heinrich Jiang, Uniform convergence rates for kernel density estimation , Proceedings of the 34th International Conference on Machine Learning, vol. 70, PMLR, 2017, pp. 1694–1703.
- 7[Lan 14] Kenneth Lange, Hadamard’s determinant inequality , The American Mathematical Monthly 121 (2014), no. 3, 258–259.
- 8[MMMY 97] Torsten Möller, Raghu Machiraju, Klaus Mueller, and Roni Yagel, Evaluation and design of filters using a taylor series expansion , IEEE Transactions on Visualization and Computer Graphics 3 (1997), no. 2, 184–199.
