On $L_2$-consistency of nearest neighbor matching
James Sharpnack

TL;DR
This paper proves that nearest neighbor matching (NNM) is $L_2$-consistent in finite dimensions without requiring smoothness or boundedness, aiding statistical inference with biased samples.
Contribution
It establishes the $L_2$-consistency of NNM under minimal assumptions, expanding understanding of its theoretical properties in biased sampling scenarios.
Findings
NNM is $L_2$-consistent without smoothness or boundedness assumptions
Discussion of applications and limitations of NNM
Comparison of NNM with inverse probability weighting
Abstract
Biased sampling and missing data complicates statistical problems ranging from causal inference to reinforcement learning. We often correct for biased sampling of summary statistics with matching methods and importance weighting. In this paper, we study nearest neighbor matching (NNM), which makes estimates of population quantities from biased samples by substituting unobserved variables with their nearest neighbors in the biased sample. We show that NNM is -consistent in the absence of smoothness and boundedness assumptions in finite dimensions. We discuss applications of NNM, outline the barriers to generalizing this work to separable metric spaces, and compare this result to inverse probability weighting.
| n | 16 | 64 | 256 | 1024 | 4096 | 16384 |
|---|---|---|---|---|---|---|
| NNW Mean | 0.990 | 1.155 | 1.236 | 1.284 | 1.312 | 1.330 |
| NNW MSE | 0.149 | 0.045 | 0.016 | 0.006 | 0.002 | 0.001 |
| IPW Var | 1.178 | 0.485 | 0.097 | 0.048 | 0.032 | 0.032 |
| Example | |||||
|---|---|---|---|---|---|
| 1. Beta | |||||
| 2. Gaussian | |||||
| 3. Fat Cantor |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Bayesian Modeling and Causal Inference · Bayesian Methods and Mixture Models
On -consistency of nearest neighbor matching
James Sharpnack
Amazon AWS
[email protected] Work done prior to joining Amazon.
Abstract
Biased sampling and missing data complicates statistical problems ranging from causal inference to reinforcement learning. We often correct for biased sampling of summary statistics with matching methods and importance weighting. In this paper, we study nearest neighbor matching (NNM), which makes estimates of population quantities from biased samples by substituting unobserved variables with their nearest neighbors in the biased sample. We show that NNM is -consistent in the absence of smoothness and boundedness assumptions in finite dimensions. We discuss applications of NNM, outline the barriers to generalizing this work to separable metric spaces, and compare this result to inverse probability weighting.
1 Introduction
Issues of representation in sampling have plagued data analysis ever since the first survey was taken. Biased sampling can be cast as a missing data problem, where data from the population of interest are partially missing, and data from the biased sample are non-missing. Let be the missing at random variable of interest, then we would like to estimate its population mean. Suppose that we have covariates and we observe one iid sample, , from the missing population and a sample of pairs, from the non-missing biased distribution (let be the corresponding sample of s). When we know the missing density, , and non-missing density, (to remember this think issing, and on-missing), we can estimate population level statistics with inverse probability weighting (IPW) and trimmed variants [15, 17]. When are unknown, our setting, we typically use matching methods or we estimate the density ratio and plug it into IPW. These methods construct a weight for each element that is dependent on , and form an estimate . In this work, we study one of the simplest matching methods, nearest neighbor matching (NNM), which sets to be the proportion of elements in for which is its nearest neighbor within . We will show that it is consistent under minimal conditions.
Contributions. Our theory is broken down into three settings of increasing difficulty and realisticness: (1) Known , noiseless , the only source of randomness is in the non-missing sample ; (2) Unknown , noiseless ; (3) Unknown , noisy , this is the standard setting of NNM. One key lemma to establish (3) is related to the measure of Voronoi cells from our biased sample, and we provide a more complete characterization which may be of separate interest. Further, we discuss the conditions and contrast NNM to IPW, highlight the barriers to generalize the theory beyond finite dimensional spaces, and discuss applications to missing data problems.
1.1 Main Theoretical Results
If the missingness is completely at random (MCAR) then this means that there is no dependence between the missingness and the variables . In this case, nothing special needs to be done in order to consistently estimate the mean of the missing , we can just compute their empirical counterparts on the non-missing data. Instead, we will assume we are in the more realistic situation, that our data is missing at random (MAR), which means that the missingness is independent of conditional on the covariate .
We observe iid covariates from density , and from density . Throughout we will assume that the variables are continuous and -dimensional. We will discuss later the numerous mathematical challenges of generalizing beyond this setting. For the non-missing data, we observe from the common distribution of (due to the MAR assumption). Our goal is to estimate,
[TABLE]
NNM can be expressed using Voronoi cells. A Voronoi cell contains the points in such that is the closest member of ,
[TABLE]
where (we will ignore ties because they are measure [math]). Let be the proportion of within . NNM estimates by where . Throughout, define the measures and .
Known , noiseless . Throughout we will let , the true regression function. We require that is integrable, and will place moment conditions on it later. In the noiseless setting almost surely (AS). If we know then we do not need to rely on a finite sample , instead we use the weight . In this special case, we will call the NNM estimator the 1NN measure of and denote it,
[TABLE]
where is the nearest neighbor of within . One should think of as a biased sampling analogue to the empirical measure (typically denoted ).
Recall that the Renyi divergence for is
[TABLE]
Furthermore, we can take to obtain the KL-divergence (). Throughout we will assume that are Hölder conjugates ().
Assumption 1**.**
Let be a constant then assume that the Renyi divergence is finite,
[TABLE]
Assumption 2**.**
Let . The test function is measurable and has finite moment,
[TABLE]
Notice that a bounded is much less restrictive then assuming that the density ratio is bounded uniformly. Furthermore, we make no smoothness assumptions on .
Theorem 1**.**
Under Assumptions 1, 2, the 1NN measure is consistent for , namely,
[TABLE]
in norm as .
Unknown , noiseless . The main difference between this setting and the previous one is that the NNM weights require an estimate of . The following is a relatively simple corollary to Theorem 1, but we state it here because it highlights the additional assumptions in this setting.
Theorem 2**.**
Under Assumptions 1, 2, the NNM estimate is consistent for noiseless , namely,
[TABLE]
in norm as .
Unknown , noisy . In this problem, we assume that where are independent with mean [math] and variance bounded by . This result hinges on our characterization of the Voronoi cells, and requires that the Chi-square divergence is finite.
Theorem 3**.**
[TABLE]
in norm as .
In the following section, we will contrast and relate these results to prior work.
1.2 Comparison to prior work
Matching methods for causal inference and missing data are appealing due to their relative simplicity, interpretability, and computational efficiency [14]. Matching can be done with replacement, where multiple missing samples can match to the same non-missing samples, such as NNM, or without replacement, such as optimal matching [21]; it has been shown that in some circumstances matching without replacement is inconsistent [22]. Increasingly, NNM has surfaced in machine learning applications such as covariate shift correction in classification [16], model based conditional independence testing [23], and deep clustering [7]. NNM is amenable to massive data applications due to fast approximate NN indexing and retrieval (for example, [20, 18]), and software for NNM and extensions have been developed [1].
The statistical efficiency of NNM has been studied in statistics and econometrics literature. In [2], it was shown that despite its popularity, in more than 1 dimensions NNM (and other similar matching methods) has a bias term that converges at a rate of , markedly worse than the optimal rate under Lipschitz continuity assumptions and bounded propensity score (the conditional probability of missingness). For this reason, corrective measures have been studied such as bias-correction, [3], where an overparametrized linear model is used to construct an additive bias estimate. Another work in response to the negative results of [2] form an estimate of the propensity score and use NNM in this 1 dimensional space [11, 5], however, this requires an accurate estimate of the propensity score. To the best of our knowledge, it remained unknown if NNM is even consistent without smoothness assumptions or bounded density ratio (the results in this work).
As we will see, Theorem 1 relies on existing results on nearest neighbor regression theory. To illustrate this under stronger assumptions— is -Lipschitz and bounded—then . We know by classical studies of KNN, [6], that the 1NN approaches the test point and boundedness gives us dominated convergence. Since, this work much has been discovered about KNN that we can potentially stand on the shoulders of. Of direct relevance to this work is [10], which studies the consistency of KNN regression in general metric spaces. Also, intermediate results from [12, 13] are relevant as well, particularly the density of Lipschitz functions in for separable metric spaces. We will argue in Section 4 that, while promising, these results are insufficient to prove our desired Theorems.
We will see that to prove the final result, Theorem 3, we will require a characterization of the Voronoi cells. [8] provides an analogous result for unbiased sampling, and we extend this result to the biased sampling setting. They find that when and bound the limiting second moment. We extend this to find that , which implies that NNM is unbiased for IPW. As mentioned, by [2] we know that this bias converges at a suboptimal rate under smoothness and boundedness assumptions. It is unknown, and outside of the scope of this work, if without these assumptions the optimal rate remains or if this is information theoretically impossible.
2 Asymptotic measure of Voronoi cells
The main observation that we make in this section is that the -measure of a Voronoi cell of samples from approaches the density ratio. These results rely heavily on the finite dimensional setting, and it is used to prove NNM consistency in the unknown , noisy case (Theorem 3). In fact, that result only requires Lemma 2 (3), but we provide our full result here because it is enlightening. This result can be interpreted as NNM is unbiased for importance sampling in the limit, since is the expected importance weight. To make this leap, we need to be specific regarding our Lebesgue points, which is a measure 1 set over which this limit holds. We follow this with our Voronoi cell result.
Lemma 1**.**
For any probability measure with a density over . There exists a set such that and the following properties hold:
- (1)
For any and , and 2. (2)
for any and any there exists an depending on such that if for some then we have that .
Lemma 2**.**
Under Assumption 1, in expectation, the -measure of a Voronoi cell around conditional on converges to the density ratio in the limit, namely,
[TABLE]
for -almost all (where the Lebesgue points are those described in Lemma 1 for ). Furthermore, we have the following bound,
[TABLE]
Remark 1**.**
The proof, in the appendix, borrows some tricks from the corresponding result in [8], although we must adapt their proof to accommodate . There seems to be an issue with the validity of the proof of Theorem 2.1(i) in [8], particularly how the Lebesgue differentiation theorem applied to fixed points and can then translate to the similar result uniformly over (which is a draw from ). Our more complete study of the Lebesgue points in Lemma 1 resolves this potential oversight, completing and generalizing the proof.
Proof sketch of Lemma 2.
Recall that is the Voronoi cell around . As was done in [8] (in the case that ), we observe that for ,
[TABLE]
where . In the appendix, we use the Lebesgue differentiation theorem (LDT) and Lemma 1 to make precise the following string of approximations
[TABLE]
and it is straightforward to see that has a uniform distribution, which after some derivation gives us (2). In order to establish (3), we follow a similar procedure. ∎
We can see the necessity of the assumption that these have densities with respect to the Lebesgue measure due to our use of used in (4).
3 Proving -consistency of NNM
This section is primarily devoted to proving -consistency in the known , noiseless case, Theorem 1. In order to prove Theorems 2, 3 we control the additional randomness due to and noisy . Both proofs are in the appendix, so the results are not restated here. For Theorem 3, we require Lemma 2. In short, the variance of the summand in , , is bounded by so we need to control the squared -measure of the Voronoi cells.
We will divide the proof of Theorem 1 into two thrusts: demonstrating asymptotic unbiasedness and diminishing variance. We will discuss in Section 4 how these results might be able to generalize to separable metric spaces.
Asymptotic unbiasedness of follows almost immediately from finite dimensional nearest neighbor theory [4] and Hölder’s inequality. We give a proof sketch here to highlight how it might easily generalize to metric spaces, in the event that new NN regression theory is developed.
Theorem 4**.**
Let be Hölder conjugates, suppose Assumption 1 and that . Then
[TABLE]
Proof.
By Hölders inequality,
[TABLE]
by Lemma 6 (see [4]) from classical NN theory, we have that
[TABLE]
which completes the proof (in fact it shows convergence). ∎
One can gain a better intuition by proving this using Lemma 2. Specifically, the expected 1NN measure is,
[TABLE]
We have pointwise convergence by (2),
[TABLE]
almost everywhere, and the RHS has expectation . What remains is to show dominated convergence (see the alternative proof of Theorem 4 in Appendix). We also demonstrate in the Appendix using instructive examples that for finite the bias is unavoidable. These are typically cases where the LDT has non-uniform convergence (see (4)).
Diminishing variance. We have established that the 1NN measure is asymptotically unbiased, but -consistency remains to be shown. Our main tool for showing this consistency is the following variance bound, which holds without any additional assumptions then those stated within. Lemma 3 demonstrates that as long as and are not too dissimilar, the variance of the 1NN measure is bounded by the discrepancy between the first and second nearest neighbor interpolants.
Lemma 3**.**
Let be Hölder conjugates then,
[TABLE]
The fact that are densities or even over is actually not required. If one were to replace with the Radon-Nikodym derivative then the result would still hold. We conclude this subsection by showing that the 1NN measure has diminishing variance.
Theorem 5** (1NN measure variance).**
Under Assumptions 1 and 2, we have that
[TABLE]
Proof.
In Lemma 7 (in Appendix) we establish that under Assumption 2 we have that for ,
[TABLE]
This result uses lemmata from the study of nearest neighbors regressors in [4]. Under Assumption 1, we have that is bounded. Applying Lemma 3 we reach our conclusion. ∎
4 A closer look at the results and their assumptions
This section demonstrates some implications and potential generalizations of the above results. First, we discuss Assumptions 1, 2 and show that the 1NN measure is -consistent in situation where IPW is not. Second, we discuss potential generalizations to separable metric spaces and the major places in which the finite dimensional assumption is required in this work.
4.1 Comparison to Inverse Probability Weighting (IPW)
Comparing consistency conditions. For this comparison, it is sufficient to consider the known , noiseless case. We will see that there are situations in which NNM achieves consistency where IPW is not guaranteed consistency. The IPW estimate can be expressed as where . The weak law of large numbers states that if , i.e. has finite second moment, then we have that . Hence, we can compare this condition,
[TABLE]
to the Assumptions 1, 2. To provide a natural comparison, we will use Hölders inequality, to obtain,
[TABLE]
as a stronger IPW condition, that is tight for some examples. Notice that this is a stronger condition than the Assumptions 1, 2, leaving us with the result that the 1NN measure is -consistent in situations where -consistency of IPW is not guaranteed.
Example where NNM is better than IPW. We construct one such example from the Student’s t-distribution. Let be -distributed with the degrees of freedom , be distributed, and . The density ratio . Then (IPW Condition) does not hold:
[TABLE]
since does not have finite fourth moment (for a constant ). However, we can select and to see that Assumptions 1, 2 hold since,
[TABLE]
and , both because has finite third moment. We can see that in simulation this bears out and NNM has lower mean squared error than IPW (Table 1).
Of course, in this example, one would use the trimmed variant of the IPW [17], where we replace the IPW with . This trimming introduces bias, but as long as we can obtain consistency by letting (perhaps extremely slowly). It is worthwhile to remember that NNM does not require knowledge or an estimate of , while IPW and its trimmed variant does. One can interpret these observations as the following: NNM implicitly trims the importance weight, trading off more bias for less variance.
4.2 Generalizing to separable metric spaces
The restrictiveness of requiring to be continuous and finite dimensional is striking when we compare these results to what we know about KNN classification [13] and Proto-NN [12]. In this section we will highlight all of the places in which the finite dimensionality (FD) assumption is used in this paper and discuss approaches to generalizing to separable metric spaces.
Noiseless . For the proof of Theorem 1, the only real place that the FD assumption was used is (5). In fact, we can use a recent result from [12] to establish consistency of the 1NN measure for separable metric spaces but under significantly more restrictive Assumptions than 1, 2. In that work, they show (Theorem 3) that ProtoNN is pointwise -consistent for classification, and in the proof they show that when is bounded AS
[TABLE]
This is exactly (5) with but with an additional boundedness assumption. If then the density ratio is also bounded AS, this implies that is -consistent (but not necessarily -consistent). Of course, a bounded density ratio and bounded dramatically weaken the result, making it not applicable to estimating expectations, variances, and many other moments, as well as not applicable to distributions such as normals, gammas, betas, etc.
It is worth attempting to weaken these assumptions and establish consistency using directly the proof techniques in [13], but there are specific barriers. First, one of the main tools used is the density of Lipschitz functions in (where now is a Borel measure). However, we would require that Lipschitz functions are dense in , which has not been established to the best of our knowledge (although we have no counter-example). Furthermore, the boundedness of is used to establish dominated convergence, and it is unclear how to get around this. To the best of our knowledge, establishing (5) under only moment assumptions in separable metric spaces is an open problem. Such a result would also be able to be used to tackle Theorem 2—unknown , noiseless .
Finally, the proof of Theorem 4.3 in [10] indicates that (5) may be established for for bounded functions in metric spaces that satisfy the Besicovitch condition. Of course, the boundedness condition violates our assumptions, but the proof of the extension of Stone’s theorem (Theorem 3.4) contains an infinite dimensional analogue of Lemma 6. However, that result relies on a somewhat opaque condition (iii’) and it is unclear if it can be generalized to -convergence, which is needed for Theorem 2. In summary, there are promising approaches to generalizing the noiseless case to metric spaces, however, it is safely outside of the scope of this work.
Noisy . The proof of Theorem 3 required the use of Lemma 2 (3). This was required to establish the convergence, , and it is unclear how to do this without our characterizations of the measure of Voronoi cells. This condition is unavoidable, because the conditional variance of for known and constant . As mentioned these results were heavily reliant on the FD assumptions and continuous , since we appealed to the translation invariance of the Lebesgue measure. Furthermore, the only precedent that we have of characterizing Voronoi cells is [8] which is also in the FD setting. As mentioned, it may be that a weaker result than Lemma 2 would be sufficient.
5 Applications to missing data problems
5.1 Imputation in massive databases
We will consider statistics that are aggregates of non-linear elementwise operations (i.e. empirical moments). Most common aggregations on database tables, such as sum, mean, variance, covariance, and count along with grouping operations and filters can be expressed in this way. Specifically, let be a partially missing random variable and be a possibly non-linear integrable function then we will focus on estimating the following functional,
[TABLE]
which is the expectation of for the missing population. For example, suppose we would like to express the following query, select mean(log(Z)) where X < 1 and Z = missing, we could use the function (in this example, ). Of course, we are not able to make such a query because it is based on unobserved data. NNM is equivalent to redefining and performing single imputation on the new with the nearest neighbor in space. However, this can be done implicitly by precomputing the NNM weights based on , and then computing for any arbitrary (without the need to recompute new weights for new ). The NNM weights need to be updated only when the index is modified via insert, delete, etc. These aggregate computations can be implemented with search indexing with approximate nearest neighbor, a standard technology for indexing in distributed databases.
5.2 Imputation of the trans-Atlantic slave trade
The trans-Atlantic slave trade (TAST), also known as the middle passage, refers to the slave ship voyages that brought African slaves to the Americas. The middle passage is reported to have forcibly migrated over 10 million Africans to the Americas over a roughly 3 century time span. The number of slaves that embarked from Africa is especially important since the number of slaves taken from Africa can impact other estimates that result from this. For example, when estimating the population of Africa in a given decade, demographers will use population growth models and more recent census data [19]. However, the population growth was stifled by the slave trade, and without accounting for it past populations will tend to be underestimated because the growth rate is overestimated.
The database that we use is the 2010 extended version of the Voyages database, [9]. There is a significant amount of missingness throughout the database— of the voyages have missing number of slaves at embarkation—which is the partially missing variable of interest. We apply NNM to compute the total number of slaves taken from Africa using the number of slave at arrival and the year for the voyage as covariates. In Figure 1, we can see the non-missing data and the 1NN imputed data (missing s filled in with its matched value). The NNM estimate of the total number of slaves taken from Africa is ,,, while the MCAR assumption over-estimates this—,,.
5.3 Assessing test loss under covariate shift
When the training and test datasets in supervised learning have different covariate distributions, then we have covariate shift [24]. Let , be the training data, the validation data, and the test data. By training a predictor on , we can consider this fixed and obtain the validation losses for each . The test error can be estimated using NNM where is missing on the test data and non-missing on . Going beyond this, [16] has used NNM to perform domain adaptation where is directly trained using a test error estimate with NNM. However, to demonstrate the validity of this approach we require uniform laws of large numbers, a future direction of research. Similarly, finite sample rates of convergence would be required to establish generalization error bounds. Overall, such results are a natural followup to this work.
Appendix A Explanation and examples
We will examine a few examples which put this theory to the test, and see numerically the convergence guaranteed in Theorem 1. Our variance bound in Lemma 3 is determined by the norm of and . It is instructive to go over the outline the proof of Lemma 2, because the proof indicates which models will yield more slowly diminishing bias than others.
Example 1.
Let be Beta and be Beta. Then and , hence the density ratio is diverging as . This is an example where have the same compact support. An unbounded density ratio causes challenges for the 1NN measure because it means that near [math] there is a significant amount of mass in but few data from to evaluate . We assessed the measure by Monte Carlo sampling with 1M samples from .
Figure 2 (right) depicts the density ratio and the -measure of the Voronoi cells. Because the Voronoi cells are random, we have that the measure is only on the average approaching the density ratio, and there is significant spread around the density ratio for a given location . Let , and we can see that and , satisfying the assumptions. Despite having unbounded density ratio, converges to its limiting expectation () as we can see in Table 2. We can see that the spread of is greater for the larger density ratios, and furthermore, for finite samples this is biased downward for near 0.
Example 2.
Let be Gaussian, be Gaussian, and . The estimation of is natural as the second moment of the unobserved population. This is an example where both densities are fully supported over . The density ratio, , is not only unbounded but growing exponentially. We can see from Figure 2 that near the origin the spread of is low, but far from the origin there is a larger spread and downward bias (in the finite sample). Due to this bias, the convergence of this example to its expectation is somewhat slower with a relative error at 10K samples (Table 2).
Example 3.
In order to see the effect of non-uniform convergence of the LDT we will use a pedagogical construction, the fat Cantor set (the Smith-Volterra-Cantor set). This set is constructed by the following algorithm: start with ; for each remove the middle of the remaining intervals, thereby splitting each interval into two parts. In simulation, we only perform 5 iterations due to our fine grid. The remaining set has measure of but does not contain any open intervals (it is entirely boundary and has no interior). Let be uniform and be uniform. We can make and so . This example has bounded over a compact domain, and bounded .
The fractal nature of this example causes non-uniform convergence of the LDT because we know that the measure of a small enough interval around approaches either [math] (if ) or (if ). However, the Fat Cantor set looks from afar as if it does have low and high density regions, and this is manifested in the fact that for within small intervals that were removed, is non-zero. In the subfigure to the right of Figure 3, we can see the density ratio is [math] in small intervals but, because these are surrounded by elements within , the Voronoi cells have large measure, . Due to the fractal nature of the fat Cantor set, for any sample size , this effect will always be manifested at some location at a small enough scale.
Regardless of this non-uniform convergence of the LDT, we observe that converges to its limit, because these regions where the LDT has not yet converged are increasingly small. We see in Table 2 that with 10K samples, we achieve a relative error of .
Appendix B Lemmata
Lemma 4** ([4] Lemma 9.1).**
Suppose that are drawn iid from a measure with a density in . Let be a Borel measurable function such that . Then
[TABLE]
where is a universal constant depending on dimension .
Lemma 5** (Stone’s Lemma, [25]; [4] Lemma 10.7).**
Suppose that are drawn iid from (a measure over the Borel -field on ), and let denote the NN of within . Let denote a probability weight vector such that . Let be a Borel measurable function such that . Then
[TABLE]
where is the minimum number of cones of angle that cover .
Lemma 6** ([4] Lemma 10.2).**
Suppose that are drawn iid from (a measure over the Borel -field on ), and let denote the NN of within . Let , and be a Borel measurable function such that . Suppose that the following conditions hold:
- (i)
There is a such that for every Borel measurable , for all ,
[TABLE] 2. (ii)
There is a constant such that for all ,
[TABLE] 3. (iii)
For all ,
[TABLE]
Then
[TABLE]
Lemma 7**.**
Suppose that are drawn iid from (a measure over the Borel -field on ), and let denote the NN of within . Let , and be a Borel measurable function such that , then
[TABLE]
Proof.
Let , then letting we see that condition (i) in Lemma 6 holds by Lemma 5. (ii) holds trivially by selecting . (iii) holds by Lemma 2.2 in [4] which states that for , almost surely (for ). ∎
Appendix C Proofs of main results
Proof of Lemma 1.
Let be the set of all such that for some , , and call the set of all such balls, . Let be
[TABLE]
Since it is the union of open sets, is open, and by the Lindelöf Covering Theorem, there is a countable subset of , , such that the interiors of the balls cover . Thus, by countable subadditivity of measures,
[TABLE]
We have that is -porous which means that there is an such that every element there is an such that where for any , there exists a with
[TABLE]
To see this let be on the segment between and in the above construction and . By the Lebesgue differentiation theorem, porous sets have Lebesgue measure [math] [26]. Hence, since -porous sets are countable unions of porous sets, by countable subadditivity, and . Let , and we have that .
We will show (2) by supposing its contradiction, that for some and , for every there exists a such that and . This implies that there exists a sequence of points , such that and . Define then we have that as . By the Bolzano-Weierstrass theorem there exists an accumulation point of , with (by continuity of ). The interior of is contained in for all . By absolute continuity with respect to Lebegue measure, by the fact that . This contradicts the fact that . ∎
Proof of Lemma 2.
Throughout, let be some constant and be a Lebesgue point as in Lemma 1 (for ) within . Let and notice that
[TABLE]
where . By integration by parts,
[TABLE]
By the Lebesgue differentiation theorem,
[TABLE]
Notice that if , the sets, converges regularly to , in the sense that
[TABLE]
and by the doubling property of ,
[TABLE]
where is a constant based on dimension, . Hence,
[TABLE]
as again by the LDT. Because we have that,
[TABLE]
For let be such that for any with ,
[TABLE]
Let guaranteed in Lemma 1 (ii) based on .
[TABLE]
as .
Thus, if we denote ,
[TABLE]
Because follows a uniform distribution then
[TABLE]
for . Similarly,
[TABLE]
Hence, by setting arbitrarily small,
[TABLE]
In order to establish (3), we will follow a similar procedure. Let independently.
[TABLE]
where . Define
[TABLE]
then . As before, by integration by parts,
[TABLE]
By (6), for any we can select a such that for any ,
[TABLE]
Let be selected as before,
[TABLE]
where . The elements in the maximum are independent uniform random variables, and so the maximum has a distribution for uniform .
[TABLE]
Also, as before
[TABLE]
Finally, by setting arbitrarily small
[TABLE]
∎
Alternative proof of Theorem 4.
Consider
[TABLE]
We have pointwise convergence by (2),
[TABLE]
almost everywhere, and the RHS has expectation . We can establish dominated convergence by
[TABLE]
where are Hölder conjugates. By assumption the first term on the RHS is bounded, what remains is to bound the second term. This can be established using theory developed primarily in [25]. A direct application of Lemma 4 to (7) concludes our proof. ∎
Proof of Lemma 3.
We will appeal to the Efron-Stein inequality, which states the following: Let be an iid copy of and , then for any function
[TABLE]
Let and denote as the 1NN measure formed from the data, . Due to exchangeability,
[TABLE]
Let and denote the 1NN within and the 1NN measure formed from the reduced data . We have that
[TABLE]
In order for to differ from it must be that and . Thus,
[TABLE]
and so,
[TABLE]
Let be the density ratio. Considering this term,
[TABLE]
∎
Proof of Theorem 2.
The random vector is multinomial conditional on . The MSE
[TABLE]
The second term converges to [math] by Theorem 1.
[TABLE]
The conditional variance is
[TABLE]
Hence, under Assumptions 1, 2,
[TABLE]
by Theorem 4. (Notice that Theorem 4 only requires the moment bound of the test function, which is satisfied for by Assumption 2.) ∎
Proof of Theorem 3.
Define ( in the noiseless setting). Let be all of the covariates,
[TABLE]
The last term converges to [math] by Theorem 2. The inner term is dominated because
[TABLE]
because . Consider
[TABLE]
Because is binomial for fixed we have that,
[TABLE]
Since we have that if . We have by Lemma 2 and dominated convergence,
[TABLE]
Hence,
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Abadie, J. L. Herr, G. Imbens, and D. M. Drukker. Nnmatch: Stata module to compute nearest-neighbor bias-corrected estimators, 2004.
- 2[2] A. Abadie and G. W. Imbens. Large sample properties of matching estimators for average treatment effects. econometrica , 74(1):235–267, 2006.
- 3[3] A. Abadie and G. W. Imbens. Bias-corrected matching estimators for average treatment effects. Journal of Business & Economic Statistics , 29(1):1–11, 2011.
- 4[4] G. Biau and L. Devroye. Lectures on the nearest neighbor method . Springer, 2015.
- 5[5] M. Busso, J. Di Nardo, and J. Mc Crary. New evidence on the finite sample properties of propensity score reweighting and matching estimators. Review of Economics and Statistics , 96(5):885–897, 2014.
- 6[6] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE transactions on information theory , 13(1):21–27, 1967.
- 7[7] Z. Dang, C. Deng, X. Yang, K. Wei, and H. Huang. Nearest neighbor matching for deep clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13693–13702, 2021.
- 8[8] L. Devroye, L. Györfi, G. Lugosi, and H. Walk. On the measure of voronoi cells. Journal of Applied Probability , 54(2):394–408, 2017.
