Consistent Kernel Density Estimation with Non-Vanishing Bandwidth
Efr\'en Cruz Cort\'es, Clayton Scott

TL;DR
This paper introduces a fixed-bandwidth kernel density estimator (fbKDE) that remains consistent for estimating continuous densities, challenging the traditional requirement of bandwidth shrinking with sample size, and demonstrates its effectiveness through theoretical analysis and experiments.
Contribution
The paper proposes the fbKDE, a novel fixed-bandwidth estimator that guarantees consistency and provides convergence rates, expanding the scope of kernel density estimation methods.
Findings
fbKDE is consistent for continuous square-integrable densities.
Provides convergence rates for radial and box kernels.
Experimental results favor fbKDE over standard and variable bandwidth KDEs.
Abstract
Consistency of the kernel density estimator requires that the kernel bandwidth tends to zero as the sample size grows. In this paper we investigate the question of whether consistency is possible when the bandwidth is fixed, if we consider a more general class of weighted KDEs. To answer this question in the affirmative, we introduce the fixed-bandwidth KDE (fbKDE), obtained by solving a quadratic program, and prove that it consistently estimates any continuous square-integrable density. We also establish rates of convergence for the fbKDE with radial kernels and the box kernel under appropriate smoothness assumptions. Furthermore, in an experimental study we demonstrate that the fbKDE compares favorably to the standard KDE and the previously proposed variable bandwidth KDE.
| Rule of Thumb | Cross-Validation | ||||||
|---|---|---|---|---|---|---|---|
| fbKDE | KDE | vKDE | fbKDE | KDE | vKDE | ||
| Bimodal | -1.0180 | -0.8110 | -0.8765 | -1.0660 | -0.9785 | -1.0413 | |
| 0.2287 | 1.0141 | 0.8468 | 0.1865 | 0.5534 | 0.2782 | ||
| Triangular | -1.2889 | -1.2897 | -1.2958 | -1.2332 | -1.2095 | -1.2073 | |
| 1.0121 | 1.0200 | 1.0923 | 1.1599 | 1.1437 | 1.2242 | ||
| Trimodal | -0.3317 | -0.2919 | -0.3025 | -0.3457 | -0.3379 | -0.3456 | |
| 0.2335 | 0.4571 | 0.4156 | 0.1212 | 0.1879 | 0.1032 | ||
| Kurtotic | -0.5444 | -0.4735 | -0.5271 | -0.5831 | -0.5647 | -0.5894 | |
| 0.2800 | 0.8122 | 0.5540 | 0.1379 | 0.3902 | 0.1181 | ||
| Banana | -0.0838 | -0.0821 | -0.0839 | -0.0821 | -0.0837 | -0.0853 | |
| Ringnorm | 2.4E-09 | -2.3E-10 | -2.7E-10 | -1.7E-10 | -3.2E-10 | -3.5E-10 | |
| Thyroid | -0.0932 | -0.0448 | -0.1415 | -0.2765 | -0.2514 | -0.2083 | |
| Diabetes | -1.4E-05 | -0.0004 | -9.8E-04 | -0.0010 | -0.0007 | -0.0010 | |
| Waveform | 1.5E-09 | -1.2E-11 | 1.25E-11 | -2.1E-12 | -1.2E-11 | -1.25E-11 | |
| Iris | 0.0166 | -0.0204 | 0.0058 | -0.0102 | 0.0027 | 0.0777 | |
| Sample size | ||||||||
|---|---|---|---|---|---|---|---|---|
| 50 | 250 | 450 | 1050 | 1650 | 1850 | 2050 | ||
| error | fbKDE | 0.7046 | 0.5847 | 0.4836 | 0.3549 | 0.1807 | 0.1642 | 0.1761 |
| KDE | 1.1341 | 1.0567 | 1.1451 | 1.0833 | 0.9684 | 0.9420 | 0.9160 | |
| vKDE | 1.0287 | 0.8811 | 1.0459 | 0.9670 | 0.8106 | 0.7562 | 0.7300 | |
| error | fbKDE | -0.8985 | -0.9623 | -0.7487 | -1.0788 | -0.9787 | -1.0493 | -0.9722 |
| KDE | -0.7099 | -0.7639 | -0.6859 | -0.8220 | -0.8284 | -0.8728 | -0.8277 | |
| vKDE | -0.7763 | -0.8284 | -0.7091 | -0.8793 | -0.8782 | -0.9372 | -0.8839 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Model Reduction and Neural Networks · Gaussian Processes and Bayesian Inference
††thanks: This work was supported in part by NSF grant 1422157.
Consistent Kernel Density Estimation with Non-Vanishing Bandwidth
Efrén Cruz Cortés Contact first author at [email protected] for further questions about this work. Electrical Engineering and Computer Science
University of Michigan
Clayton Scott
Electrical Engineering and Computer Science
University of Michigan
Abstract
Consistency of the kernel density estimator requires that the kernel bandwidth tends to zero as the sample size grows. In this paper we investigate the question of whether consistency is possible when the bandwidth is fixed, if we consider a more general class of weighted KDEs. To answer this question in the affirmative, we introduce the fixed-bandwidth KDE (fbKDE), obtained by solving a quadratic program, and prove that it consistently estimates any continuous square-integrable density. We also establish rates of convergence for the fbKDE with radial kernels and the box kernel under appropriate smoothness assumptions. Furthermore, in an experimental study we demonstrate that the fbKDE compares favorably to the standard KDE and the previously proposed variable bandwidth KDE.
1 Introduction
Given an iid sample drawn according to a probability density , the kernel density estimator is
[TABLE]
where is a kernel function with parameter . Examples of kernels are functions of the form , where for all . A common kernel for density estimation is the Gaussian kernel . Since its inception by Fix and Hodges [1] and development by Rosenblatt and Parzen [2, 3], the KDE has found numerous applications across a broad range of quantitative fields, and has also been the subject of extensive theoretical investigations, spawning several books (see, e.g, [4, 5, 6, 7]) and hundreds of research articles.
A strength of the KDE is that it makes few assumptions about and that it is consistent, meaning that it converges to as [8]. Analysis of the KDE stems from the following application of the triangle inequality in some norm of interest, where denotes convolution: . Critical to the analysis of the KDE is the dependence of the bandwidth parameter on . The first term tends to zero provided , i.e, the number of data points per unit volume tends to infinity. This is shown using properties of convolutions (since and may be viewed as convolutions of the kernel with the empirical and true distributions, respectively) and concentration of measure. For the latter term to tend to zero, is necessary, so that the kernel behaves like a Dirac measure. With additional assumptions on the smoothness of , the optimal growth of as a function of can be determined.
The choice of bandwidth determines the performance of the KDE, and automatically selecting the optimal kernel bandwidth remains a difficult problem. Thus, researchers have developed numerous plug-in rules and cross-validation strategies, all of which are successful in some situations. A recent survey counts no fewer than 30 methods in the literature and cites 6 earlier review papers on the topic [9].
As an alternative to the standard KDE, some authors have investigated weighted KDEs, which have the form The weight vector is learned according to some criterion, and such weighted KDEs have been shown to yield improved performance over the standard KDE in certain situations [10, 11, 12, 13]. Consistency of such estimators has been investigated, but still under the assumption that with [14, 15].
In this work we consider the question of whether it is possible to learn the weights of a weighted KDE such that the resulting density estimator is consistent, for a broad class of , where the bandwidth remains fixed as . This question is of theoretical interest, given that all prior work establishing consistency of a KDE, to our knowledge, requires that the bandwidth shrinks to zero. The question is also of practical interest, since such a density estimator could potentially be less sensitive to the choice of bandwidth than the standard KDE, which, as mentioned above, is the main factor limiting the successful application of KDEs in practice.
In Section 2 below, we introduce a weighted KDE that we refer to as the fixed-bandwidth KDE (fbKDE). Its connection to related work is given in Section 3. The theoretical properties of this estimator, including consistency and rates of convergence with a fixed bandwidth, are established in Section 4. Our analysis relies on reproducing kernel Hilbert spaces, a common structure in machine learning that has seldom been used to understand KDE’s. In Section 5, a simulation study is conducted to compare the fbKDE against the standard KDE and another weighted KDE from the literature. Our results indicate that the fbKDE is a promising alternative to these other methods of density estimation.
2 Fixed bandwidth KDE
We start by assuming access to iid data sampled from an unknown distribution with density , having support contained in the known set , and with dominating measure . The set is either compact, in which case is taken to be the Lebesgue measure, or , in which case is a known finite measure. We study a weighted kernel density estimator of the form
[TABLE]
where , , and is sampled iid from a known distribution with density . Here, is chosen to ensure , but not necessarily in . Note that is defined on and on . Throughout, is taken to be an ball in , that is for . The reason for centering the kernels at instead of is that to accurately approximate with a fixed bandwidth, we might need centers outside the support of .
To measure the error between to we may consider the difference, where is the space of equivalence classes of square integrable functions, and where (using both for the function and its equivalence class) . However, we do not know the set and cannot compute said difference. Hence, we consider the difference from to .
To determine the scaling coefficients we consider minimizing over . Since , the term is independent of , and is zero outside , we can focus on minimizing
[TABLE]
Define . Since we don’t know , we don’t know the true form of and . However, the terms are expectations with respect to so we can estimate the term using the available data . We use the leave-one-out estimator
[TABLE]
With the aid of , we define the function
[TABLE]
to which we have access. Let and
[TABLE]
The estimator
[TABLE]
is referred to as the fixed bandwidth kernel density estimator (fbKDE), and is the subject of our study. In the following sections this estimator is shown to be consistent for a fixed kernel bandwidth under certain conditions on and .
3 Related work
The use of the norm as an objective function for kernel density estimation is not new, and in fact, choosing to minimize , with , is the so-called least-squares leave-one-out cross-validation technique for bandwidth selection. Minimizing subject to the constraints and was proposed by [10], and later rediscovered by [11], who also proposed an efficient procedure for solving the resulting quadratic program, and compared the estimator to the KDE experimentally. This same estimator was later analyzed by [14], who established an oracle inequality and consistency under the usual conditions for consistency of the KDE.
Weighted KDEs have also been developed as a means to enhance the standard KDE in various ways. For example, a weighted KDE was proposed in [12] as a form of multiple kernel learning, where, for every data point, multiple kernels of various bandwidths were assigned, and the weights optimized using . A robust kernel density estimator was proposed in [13], by viewing the standard KDE as a mean in a function space, and estimating the mean robustly. To improve the computational efficiency of evaluating a KDE, several authors have investigated sparse KDEs, learned by various criteria [16, 17, 18, 19, 20].
The one-class SVM has been shown to converge to a truncated version of in the norm [21]. If the truncation level (determined by the SVM regularization parameter) is high enough, and the density is bounded, then is consistently estimated. An ensemble of kernel density estimators is studied by [15], who introduce aggregation procedures such that a weighted combination of standard KDEs of various bandwidths performs as well as the KDE whose bandwidth is chosen by an oracle.
In the above-cited work on weighted KDEs, whenever consistency is shown, it assumes a bandwidth tending to zero. Furthermore, the weights are constrained to be nonnegative. In contrast, we allow the weights on individual kernels to be negative, and this enables our theoretical analysis below. Finally, we remark that the terms “fixed" or “constant" bandwidth have been used previously in the literature to refer to a KDE where each data point is assigned the same bandwidth, as opposed to a “variable bandwidth" KDE where each data point might receive a different bandwidth. We instead use “fixed bandwidth" to mean the bandwidth remains fixed as the sample size grows.
4 Theoretical Properties of the fbKDE
Notation
The space is the set of functions on for which the power of their absolute value is -integrable over . is the set of equivalence classes of , where two functions and are equivalent if . The symbol will denote both the Euclidean norm in as well as the norm; which is used will be clear from the context, as the elements of will be denoted by letters towards the end of the alphabet (, and ). The set denotes the space of continuous functions on . Finally, the support of any function is denoted by .
We call a function a symmetric positive definite kernel (SPD) [22] if is symmetric and satisfies the property that for arbitrary and the matrix is positive semidefinite. Every SPD kernel has an associated Hilbert space of real-valued functions, called the reproducing kernel Hilbert space (RKHS) of , which we will denote by when and are clear from the context, and for which . SPD kernels also exhibit the reproducing property, which states that for any , . By a radial SPD kernel we mean an SPD kernel for which there is a strictly monotone function such that for any . Note that the Gaussian kernel is a radial SPD kernel.
If is a radial SPD kernel then for some . This holds because by the reproducing property and Cauchy-Schwarz, and .
We will make use of the following assumptions:
A0
The set is either a compact subset of , in which case is taken to be the Lebesgue measure, or , in which case is a known finite measure.
This assumption is always held throughout the paper. It will not be explicitly stated in the statements of results below, but will be remembered in the proofs when needed.
A1
are sampled independently and identically distributed (iid) according to . are sampled iid according to . Furthermore are independent.
Given as in A1, we define for . This notation will be kept throughout the paper.
A2
The kernel is radial and SPD, with for all . Furthermore, is Lipschitz, that is, there exists such that holds for all , .
Recall from eqn. (3) that is the minimizer of over . To show the consistency of the overall approach of the following sections will be to show that is close to with high probability, and then show that (and therefore ) can be made arbitrarily small for optimal choice of . We start by stating an oracle inequality relating and .
Lemma 1**.**
Let . Let satisfy assumption A1. Let satisfy and be as in Equation (1). Let . With probability ,
[TABLE]
∎
The proof consists of showing the terms of concentrate around the terms of , and is placed in Section 7. This result allows us to focus on the term , which we proceed to do in the following sections.
4.1 Consistency of
We state the consistency of for radial SPD kernels:
Theorem 1**.**
Let . Let satisfy A1, where and . Let satisfy A2 and be as in Equation (4). If is such that but as , then
[TABLE]
as . ∎
The probability is the joint probability of . The sketch of the proof is as follows. To analyze the term from Lemma 1, we use the fact that is dense in [22] in the sup-norm sense and that is dense in in the norm sense [23]. That is, there exists a function arbitrarily close to and a function arbitrarily close to . The function has the form
[TABLE]
where . Note this is an abuse of notation since the functions and do not have the same centers nor necessarily the same number of components. By the triangle inequality we have for any in :
[TABLE]
By the above denseness results, the first two terms are small. To make the third term small we need two things: that is large enough so that there is an matching , and that there exists centers of that are close to the centers of , which will be true with a certain probability. In Section 7 we will prove all these approximations and show that the relevant probability is indeed large and approaches one.
4.2 Convergence rates of for SPD radial kernels
The rates for radial SPD kernels may be slow, since these kernels are "universal" in that they can approximate arbitrary functions in . To get better rates, we can make stronger assumptions on . Thus, let , that is, the space of densities expressible as -smoothed functions. Then we obtain the following convergence rates.
Theorem 2**.**
Let . Let , satisfy A2, satisfy A1 with , and . Let be as in Equation (4). If and , then with probability
[TABLE]
If and for C an arbitrary constant in , then with probability
[TABLE]
∎
The symbol indicates the first term is bounded by a positive constant (independent of and ) times the second term, and means they grow at the same rate. Note the condition is satisfied, for example, if is compact and is Gaussian. The first step in proving Theorem 2 is just a reformulation of the oracle inequality from Lemma 1:
Lemma 2**.**
Let . Let satisfy assumption A1. Let satisfy assumption A2 and be as in Equation (1). Then with probability
[TABLE]
where . ∎
Now, for the term in Lemma 2 we introduce the function as in eqn (5) and make the following decomposition, valid for any : The following lemma concerns the term , and is taken from [24]:
Lemma 3**.**
Let . For any there are m points and coefficients such that
[TABLE]
where for some absolute constant and where is the VC-dimension of the set . ∎
The VC dimension of a set is the maximum number of points that can be separated arbitrarily by functions of the form , . For radial kernels, (see [25], [26]).
Now let and . For the remaining term we will proceed as in the proof of Theorem 1. That is, we need that for all there is a point close to it, and then we can approximate with .
Lemma 4**.**
Let , let , and be as above. Let satisfy assumption A1. With probability
[TABLE]
where , for a constant independent of and . ∎
Putting Lemmas 2, 3, and 4 together and choosing for an appropriate (the exact form is shown in Section 7) we obtain the proof of Theorem 2.
4.3 Convergence rates for the box kernel
While the previous theorem considered radial SPD kernels, the oracle inequality applies more generally, and in this section we present rates for a nonradial kernel, the box kernel. We assume and that the kernel centers are predetermined according to a uniform grid. Precise details are given in the proof of Theorem 3. Thus the fbKDE centered at is where the weights are learned in the same way as before, the only change being the kernel centers. Let , where are the Lipschitz functions on , and let be the Lipschitz constant of . Also, let be the box kernel defined on , and for simplicity assume for a positive integer. The following theorem is proved in Section 7.
Theorem 3**.**
Let , , satisfy A1, and . Let . With probability at least
[TABLE]
∎
As with the previous results, the stochastic error is controlled via the oracle inequality. Whereas the preceding results leveraged known approximation properties of radial SPD kernels, in the case of the box kernel we give a novel argument for approximating Lipschitz functions with box kernels having a non-vanishing bandwidth.
5 Experimental Results
We now show the performance of the fbKDE as compared to the KDE and variable bandwidth KDE. The variable bandwidth KDE ([27]), which we refer as the vKDE, has the form where each data point has an individual bandwidth . [27] proposes to use where is the standard KDE’s bandwidth parameter and is the geometric mean of .
When implementing the fbKDE, there are a few considerations. First, when computing , the first term of , we must compute the integral . For computational considerations we assume and is the Lebesgue measure. This deviates from our theory, which requires finite for . Thus, for the Gaussian kernel, which we use in our experiments, this leads to . To obtain the weights we have to solve a quadratic program. We used the ADMM algorithm described in [28], with the aid of the projection algorithm from [29].
We examine a few benchmark real datasets as well as synthetic data from 1-dimensional densities. The synthetic densities are the triangular density as well as three Gaussian mixtures, a bimodal, a trimodal and a kurtotic, as shown in Figure 2. We computed the parameters , , and in two different ways. First we used rules of thumb. For , we used Silverman’s rule of thumb ([4]). For we used, based on the convergence rates, for and for . For we used, inspired by the Jaakkola heuristic [30], the median distance to the nearest neighbor. Second we used a -fold CV procedure over parameter combinations drawn randomly from , with for and otherwise (see [31]). for , and for , where the subscript indicates logarithmic spacing and where the range is chosen thus since the data is standardized and, for the range, informed by the convergence rates. Finally, we used to be of the original data size for training and the rest for testing. We compute the value as , where is the test set. Figure 2 shows the bimodal density, the fbKDE, KDE, and vKDE along with the values for the fbKDE. Table 1 shows the error as well as the error, where for some function , .
In Figure 2 the density has two Gaussian elements with different widths. It is difficult for KDE and even for the vKDE to approximate such a density. The fbKDE, however, is able to approximate components of different smoothness by making some weights negative. These weights subtract excess mass in regions of refinement. Note that by doing so the fbKDE may itself overshoot and become negative. A similar behavior is exhibited for other densities, in which smoothness varies across regions. In Table 1 we report the performance of the three estimators for both CV and rule of thumb. Note the fbKDE often performs better in terms of the , and when the bandwidth is chosen according to a rule of thumb. The fbKDE outperforms both the KDE and vKDE in about half of the cases.
Finally we show, for the bimodal density, a comparison of the performance as the sample size grows. We have chosen the parameters according to the rules of thumb discussed above. Table 2 presents the errors. Note that as the sample size grows the KDE and vKDE do not significantly improve, even though the bandwidth is being updated according to Silverman’s rule. The fbKDE leverages new observations and refines its approximation, and this effect is more obvious for the case. Indeed, the error for fbKDE is smaller at than at for KDE and vKDE at . Similar results hold for the other synthetic datasets. This highlights a notable property of the fbKDE, that it can handle densities with differing degrees of smoothness.
6 Conclusion
We have presented a new member of the family of kernel estimators, the fixed bandwidth kernel density estimator, with desirable statistical properties. In particular, we showed the fbKDE is consistent for fixed kernel parameter , and we provided convergence rates. The fbKDE is a good alternative to the KDE in cases where computing an optimal bandwidth is difficult and for densities that are hard to approximate with inappropriate bandwidths. In these cases and as is shown in the experimental section, the fbKDE can greatly improve on the KDE. The way in which fbKDE achieves a more refined approximation is by balancing properly placed positive and negative weights, sometimes outside of the original support, which is facilitated by the variables, and which is not possible with the standard KDE. A few problems of interest remain open. We have illustrated two possible rate of convergence results, but expect many other such results are possible, depending on the choice of kernel and smoothness assumptions. It also remains an open problem to extend our results to more general domains and to dependent data.
7 Proofs
7.1 Oracle Inequality
First recall the oracle inequality lemma:
Lemma 1
Let . Let satisfy assumption A1. Let satisfy assumption A2 and be as in Equation (1). Let . With probability ,
[TABLE]
Proof of Lemma 1.
Recalling the following definition from Section 2, we have
[TABLE]
and
[TABLE]
were we have used the simplified notation to represent . To simplify notation further, we let represent and use for , and the same goes for . We now look at the probability that is close to . We have
[TABLE]
Now let and note that . We have
[TABLE]
by the independence assumption of A1. Furthermore,
[TABLE]
We now bound the term inside the integral. Since this quantity only depends on , through , we abbreviate it as , where . First, note that for any . Also, by assumption there is a such that for all . Hence,
[TABLE]
where we have used Hoeffding’s inequality. So we obtain
[TABLE]
and
[TABLE]
Therefore, letting , for any
[TABLE]
So, with probability , , and . Recall that for all . Then with probability ,
[TABLE]
If we substitute we obtain the desired result. ∎
7.2 Consistency of
In the following we will make use of the fact that, for continuous positive definite radial kernels, the RKHS norm dominates the -norm which in turn dominates the norm. We state this as a lemma.
Lemma 5**.**
Let , , satisfy assumptions A0 and A2. Then for any we have
[TABLE]
Proof.
By A0 and A2 is bounded and continuous, and by Lemma 4.28 of [23], so is every element of . Hence, for and for either compact or all of the essential supremum equals the supremum, so we obtain
[TABLE]
where the penultimate inequality follows from the reproducing property and the last inequality is just Cauchy-Schwarz. ∎
Now to prove Theorem 1 we need a couple intermediate lemmas. First, recall
[TABLE]
and
[TABLE]
Note that is dense in ([23]). Assume is dense in . Then, given and , there is an and of the form
[TABLE]
such that and . We use these functions to bound :
[TABLE]
and show the last two terms are small in the following lemma.
Lemma 6**.**
Let . Let satisfy assumption A2 and . Then
[TABLE]
Proof.
If is compact, then is dense in (see [22]). Therefore, for fixed , there is an such that
[TABLE]
and by Lemma 5
[TABLE]
If , [22] tells us is dense in , so it directly follows that, for any , there is an satisfying
[TABLE]
Similarly, since is dense in [23], for any fixed there is an such that
[TABLE]
hence, by lemma 5
[TABLE]
Therefore:
[TABLE]
∎
Note that implies for some and where for all . To make the first term small, we first quantify the continuity of the kernel . Let and define
[TABLE]
where is the Lipschitz constant of . Then for every and in we have that implies .
Recall . With the above result in hand we now have to make sure that at least a subset of the centers of are close to the centers of with high probability. First, define and define . Then we obtain the following lemma:
Lemma 7**.**
Let and , and , as above. Then
[TABLE]
for all , where .
Proof of Lemma 7.
Let the event . Then
[TABLE]
where . Let and recall that for we have . Hence , and
[TABLE]
For this term to approach zero we need to be strictly positive. This follows from the assumption that . Since and only depend on and other constants, we get
[TABLE]
as . ∎
Lemma 8**.**
Let , satisfy A0, satisfy A1, and satisfy A2. Then, such that
[TABLE]
for sufficiently large .
Proof.
Let . With probability we have that for every there is an such that . Then, for defined as
[TABLE]
we have
[TABLE]
Note that if for two sequences we have for , then , granted such limits exist. Let and . Note that for large enough, say , and therefore . So we can see that for , the inequality
[TABLE]
holds with probability . That is, for , , hence since , we get . ∎
Proof of Theorem 1.
[TABLE]
By Lemma 1 the middle term approaches zero and by Lemma 8 the last term does, so
[TABLE]
where . ∎
7.3 Convergence Rates of
The proof of Lemma 3 is found in [24]. Now let and , where , , and , , for , are as in Lemma P 3. Recall Lemma P 4:
Lemma 4
Let , let and let and be as above. Let satisfy assumption A1, then with probability
[TABLE]
where , for a constant independent of and . ∎
Proof of Lemma 4.
Let , where is the Lipschitz constant of the . From Lemmas 7 and 8 we know that with probability the event that for all there is a data point such that holds. Recall . Now
[TABLE]
where is the volume of the -dimensional unit ball.
Now pick the coefficients as in Lemma 7. With probability we have:
[TABLE]
Fixing and setting , where , we obtain
[TABLE]
Finally, noting that , where is as in Lemma 3, yields the desired result. ∎
We now prove Theorem 2.
Proof of Theorem 2.
We will use the notation and results of Lemmas 2, 3 and 4. First, note that from Lemma 3 we have So putting the three Lemmas together we have that with probability
[TABLE]
or, taking into account only the dependence on and for the different ’s, we have that is of the order of
[TABLE]
Now, we want the number of centers in close to but no larger than the number of data points, so we set for some such that . Furthermore, we need to grow accordingly, so that , for a constant possibly dependent on . This yields, ignoring the terms for now:
[TABLE]
Setting the first two rates equal we obtain . Note that if we can match the third term by setting to obtain an overall rate of . Otherwise we can set to any small number to obtain a rate slightly slower than . Finally, for the terms, just let and note that if then dominates, and if then dominates. ∎
7.4 Box rates
First recall the definition of from Section 4.3: For any and let be such that , then
[TABLE]
And, also from section 4.3: , , where are the Lipschitz functions on , is the Lebesgue measure and is the Lipschitz constant of . Also, is the box kernel defined on , and for simplicity we assume for a positive integer.
Theorem 3
Let , , satisfy A1 and . Let , then, with probability greater than or equal to
[TABLE]
∎
To prove this theorem we begin by reformulating Lemma 1:
Lemma 9**.**
Let . Let satisfy A1 and satisfy . If is as above, then with probability
[TABLE]
where . ∎
We now take care of the term .
Lemma 10**.**
Let . For any there is a function of the form , where and satisfying
[TABLE]
where . ∎
Proof of Lemma 10.
Let be a multi-index with positive elements and associated index related by the function :
[TABLE]
and its inverse
[TABLE]
Divide into hypercube regions of equal volume to form the partition , where . Now, let
[TABLE]
where . Any choice of works but for clarity we choose . Note that is close to :
[TABLE]
Hence
[TABLE]
Now we note that can also be expressed as a sum of fixed bandwidth kernels:
[TABLE]
where
[TABLE]
and is as follows. Let and
[TABLE]
for , where and are multi-indices. Note that the ’s sequentially capture the residual of the function as we travel along the regions.
To find note that since , we have . Also note for , hence for
[TABLE]
For larger we have . Note that when we have lost influence of , so
[TABLE]
and, similarly, for . The process continues such that, in general
[TABLE]
for . Adding these we obtain . Denote this quantity by . Then this process is repeated over every dimension, having chunks of multiples of , the first multiple is , the second , and so on. The final sum is then
[TABLE]
Therefore
[TABLE]
∎
Proof of Theorem 3.
This proof is similar to the proof of Theorem 2. Combining Lemmas 9, 10, setting , ignoring the terms for now and considering only the dependence on and we obtain
[TABLE]
Setting these terms equal we obtain , with an overall rate of . Adding the log term we obtain
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] E. Fix and J. L. Hodges Jr, “Discriminatory analysis-nonparametric discrimination: consistency properties,” USAF School of Aviation Medicine, Report No. 4, Tech. Rep. 21-49-004, 1951.
- 2[2] M. Rosenblatt, “Remarks on some nonparametric estimates of a density function,” The Annals of Mathematical Statistics , vol. 27, no. 3, pp. 832–837, 1956.
- 3[3] E. Parzen, “On estimation of a probability density function and mode,” The Annals of Mathematical Statistics , vol. 33, no. 3, pp. 1065–1076, 1962.
- 4[4] B. W. Silverman, Density Estimation for Statistics and Data Analysis . London: Chapman and Hall, 1986.
- 5[5] D. W. Scott, Multivariate Density Estimation . New York: Wiley, 1992.
- 6[6] M. P. Wand and M. C. Jones, Kernel Smoothing . Chapman & Hall/CRC, 1994.
- 7[7] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation . New York: Springer, 2001.
- 8[8] D. Wied and R. Weißbach, “Consistency of the kernel density estimator: a survey,” Statistical Papers , vol. 53, no. 1, pp. 1–21, 2012.
