On $L_2$-consistency of nearest neighbor matching

James Sharpnack

arXiv:1902.02408·math.ST·June 2, 2022·IEEE Trans. Inf. Theory

On $L_2$-consistency of nearest neighbor matching

James Sharpnack

PDF

Open Access

TL;DR

This paper proves that nearest neighbor matching (NNM) is $L_2$-consistent in finite dimensions without requiring smoothness or boundedness, aiding statistical inference with biased samples.

Contribution

It establishes the $L_2$-consistency of NNM under minimal assumptions, expanding understanding of its theoretical properties in biased sampling scenarios.

Findings

01

NNM is $L_2$-consistent without smoothness or boundedness assumptions

02

Discussion of applications and limitations of NNM

03

Comparison of NNM with inverse probability weighting

Abstract

Biased sampling and missing data complicates statistical problems ranging from causal inference to reinforcement learning. We often correct for biased sampling of summary statistics with matching methods and importance weighting. In this paper, we study nearest neighbor matching (NNM), which makes estimates of population quantities from biased samples by substituting unobserved variables with their nearest neighbors in the biased sample. We show that NNM is $L_{2}$ -consistent in the absence of smoothness and boundedness assumptions in finite dimensions. We discuss applications of NNM, outline the barriers to generalizing this work to separable metric spaces, and compare this result to inverse probability weighting.

Tables2

Table 1. Table 1: A comparison of NNM and IPW for the t-distribution example, where the true value is 𝔼 μ | X | ≈ 1.356 subscript 𝔼 𝜇 𝑋 1.356 \mathbb{E}_{\mu}|X|\approx 1.356 . IPW is unbiased, so the variance equals the mean square error (MSE).

n	16	64	256	1024	4096	16384
NNW Mean	0.990	1.155	1.236	1.284	1.312	1.330
NNW MSE	0.149	0.045	0.016	0.006	0.002	0.001
IPW Var	1.178	0.485	0.097	0.048	0.032	0.032

Table 2. Table 2: Monte Carlo samples of ℚ 1 ( η ) subscript ℚ 1 𝜂 \mathbb{Q}_{1}(\eta) with n 𝑛 n samples from ν 𝜈 \nu for the three example. For comparison purposes, a sample from μ 𝜇 \mu is provided, the expectation of which is the limit of ℚ 1 ( η ) subscript ℚ 1 𝜂 \mathbb{Q}_{1}(\eta) .

Example	$n = 1 e 2$	$1 e 3$	$1 e 4$	$1 e 5$	$\sim μ$
1. Beta	$1.472$	$1.483$	$1.492$	$1.493$	$1.5$
2. Gaussian	$1.774$	$1.991$	$2.063$	$2.083$	$2.1$
3. Fat Cantor	$1.587$	$1.842$	$1.970$	$1.997$	$2$

Equations171

G := E [Y ∣ missing] = \int y \cdot f_{Y ∣ X} (y ∣ x) μ (x) d x .

G := E [Y ∣ missing] = \int y \cdot f_{Y ∣ X} (y ∣ x) μ (x) d x .

S_{j} := {x \in R^{p} : ∥ X_{j} - x ∥ = k \in [n] min ∥ X_{k} - x ∥},

S_{j} := {x \in R^{p} : ∥ X_{j} - x ∥ = k \in [n] min ∥ X_{k} - x ∥},

Q_{1} (η) := j = 1 \sum n M (S_{j}) \cdot η (X_{j}) = \int η (X_{(1)} (x)) μ (x) d x,

Q_{1} (η) := j = 1 \sum n M (S_{j}) \cdot η (X_{j}) = \int η (X_{(1)} (x)) μ (x) d x,

D_{q_{0}} (μ ∣∣ ν) = \frac{1}{q _{0} - 1} ln \int (\frac{μ ( x )}{ν ( x )})^{q_{0}} ν (x) d x .

D_{q_{0}} (μ ∣∣ ν) = \frac{1}{q _{0} - 1} ln \int (\frac{μ ( x )}{ν ( x )})^{q_{0}} ν (x) d x .

D_{q_{0}} (μ ∣∣ ν) < \infty.

D_{q_{0}} (μ ∣∣ ν) < \infty.

\int ∣ η (x) ∣^{2 q_{1}} ν (x) d x < \infty.

\int ∣ η (x) ∣^{2 q_{1}} ν (x) d x < \infty.

Q_{1} (η) \to \int η (x) \cdot μ (x) d x,

Q_{1} (η) \to \int η (x) \cdot μ (x) d x,

\hat{G} = j = 1 \sum n \hat{M} (S_{j}) \cdot η (X_{j}) \to \int η (x) \cdot μ (x) d x,

\hat{G} = j = 1 \sum n \hat{M} (S_{j}) \cdot η (X_{j}) \to \int η (x) \cdot μ (x) d x,

\hat{G} = j = 1 \sum n \hat{M} (S_{j}) \cdot Y_{j} \to G,

\hat{G} = j = 1 \sum n \hat{M} (S_{j}) \cdot Y_{j} \to G,

n \to \infty lim n E [M (S_{1}) ∣ X_{1} = x] = \frac{μ ( x )}{ν ( x )},

n \to \infty lim n E [M (S_{1}) ∣ X_{1} = x] = \frac{μ ( x )}{ν ( x )},

n \to \infty lim sup n^{2} E [M^{2} (S_{1}) ∣ X_{1} = x] \leq 2 (\frac{μ ( x )}{ν ( x )})^{2} .

n \to \infty lim sup n^{2} E [M^{2} (S_{1}) ∣ X_{1} = x] \leq 2 (\frac{μ ( x )}{ν ( x )})^{2} .

E [M (S_{1}) ∣ X_{1} = x] = P {X_{0} \in S_{1} ∣ X_{1} = x}

E [M (S_{1}) ∣ X_{1} = x] = P {X_{0} \in S_{1} ∣ X_{1} = x}

= P {\cap_{i = 2}^{n} {X_{i} \in / B (X_{0}, ∥ X_{0} - x ∥)}} = E [(1 - Z (x))^{n - 1}],

N (B (X_{0}, ∥ X_{0} - x ∥)) \approx ν (x) λ (B (X_{0}, ∥ X_{0} - x ∥))

N (B (X_{0}, ∥ X_{0} - x ∥)) \approx ν (x) λ (B (X_{0}, ∥ X_{0} - x ∥))

= ν (x) λ (B (x, ∥ X_{0} - x ∥)) \approx \frac{ν ( x )}{μ ( x )} M (B (x, ∥ X_{0} - x ∥)),

n \to \infty lim E [Q_{1} (η)] = \int η (x) μ (x) d x .

n \to \infty lim E [Q_{1} (η)] = \int η (x) μ (x) d x .

E [Q_{1} (η)] - \int η (x) μ (x) d x = E [\int (η (X_{(1)} (x)) - η (x)) μ (x) d x]

E [Q_{1} (η)] - \int η (x) μ (x) d x = E [\int (η (X_{(1)} (x)) - η (x)) μ (x) d x]

\leq (\int ∣ η (X_{(1)} (x)) - η (x) ∣^{q_{1}} ν (x) d x)^{1/ q_{1}} \cdot (\int (\frac{μ ( x )}{ν ( x )})^{q_{0}} ν (x) d x)^{1/ q_{0}}

\int ∣ η (X_{(1)} (x)) - η (x) ∣^{q_{1}} ν (x) d x \to 0.

\int ∣ η (X_{(1)} (x)) - η (x) ∣^{q_{1}} ν (x) d x \to 0.

E [Q_{1} (η)] = E [η (X_{1}) \cdot n E [M (S_{1}) ∣ X_{1}]] .

E [Q_{1} (η)] = E [η (X_{1}) \cdot n E [M (S_{1}) ∣ X_{1}]] .

η (X_{1}) \cdot n E [M (S_{1}) ∣ X_{1}] \to \frac{μ ( X _{1} )}{ν ( X _{1} )} η (X_{1}),

η (X_{1}) \cdot n E [M (S_{1}) ∣ X_{1}] \to \frac{μ ( X _{1} )}{ν ( X _{1} )} η (X_{1}),

V (Q_{1} (η)) \leq 2 (\int (\frac{μ ( x )}{ν ( x )})^{q_{0}} ν (x) d x)^{\frac{1}{q _{0}}} (\int η (X_{(1)} (x)) - η (X_{(2)} (x))^{2 q_{1}} ν (x) d x)^{\frac{1}{q _{1}}} .

V (Q_{1} (η)) \leq 2 (\int (\frac{μ ( x )}{ν ( x )})^{q_{0}} ν (x) d x)^{\frac{1}{q _{0}}} (\int η (X_{(1)} (x)) - η (X_{(2)} (x))^{2 q_{1}} ν (x) d x)^{\frac{1}{q _{1}}} .

V (Q_{1} (η)) \to 0, as n \to \infty.

V (Q_{1} (η)) \to 0, as n \to \infty.

E ∣ η (X_{(1)} (X)) - η (X_{(2)} (X)) ∣^{2 q_{1}} \to 0.

E ∣ η (X_{(1)} (X)) - η (X_{(2)} (X)) ∣^{2 q_{1}} \to 0.

\int (η (x) \cdot \frac{μ ( x )}{ν ( x )})^{2} ν (x) d x < \infty,

\int (η (x) \cdot \frac{μ ( x )}{ν ( x )})^{2} ν (x) d x < \infty,

D_{2 q_{0}} (μ ∣∣ ν) < \infty, \int ∣ η (x) ∣^{2 q_{1}} ν (x) d x < \infty,

D_{2 q_{0}} (μ ∣∣ ν) < \infty, \int ∣ η (x) ∣^{2 q_{1}} ν (x) d x < \infty,

\int (η (x) \cdot \frac{μ ( x )}{ν ( x )})^{2} ν (x) d x \geq C_{k} \int x^{2} \cdot (1 + x^{2}) ν (x) d x = \infty,

\int (η (x) \cdot \frac{μ ( x )}{ν ( x )})^{2} ν (x) d x \geq C_{k} \int x^{2} \cdot (1 + x^{2}) ν (x) d x = \infty,

\int (\frac{μ ( x )}{ν ( x )})^{q_{0}} ν (x) d x \leq C_{k}^{'} \int (1 + x^{2})^{3/2} ν (x) d x < \infty,

\int (\frac{μ ( x )}{ν ( x )})^{q_{0}} ν (x) d x \leq C_{k}^{'} \int (1 + x^{2})^{3/2} ν (x) d x < \infty,

\int ∣ η (X_{(1)} (x)) - η (x) ∣ ν (x) d x \to 0.

\int ∣ η (X_{(1)} (x)) - η (x) ∣ ν (x) d x \to 0.

G := E [g (X, Z) ∣ missing],

G := E [g (X, Z) ∣ missing],

E ∣ g (X_{(1)} (X)) ∣^{q} \leq γ_{p} E ∣ g (X) ∣^{q},

E ∣ g (X_{(1)} (X)) ∣^{q} \leq γ_{p} E ∣ g (X) ∣^{q},

E k = 1 \sum n v_{k} ∣ g (X_{(k)} (X)) ∣ \leq 2 γ_{p} E ∣ g (X) ∣,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Bayesian Modeling and Causal Inference · Bayesian Methods and Mixture Models

Full text

On $L_{2}$ -consistency of nearest neighbor matching

James Sharpnack

Amazon AWS

[email protected] Work done prior to joining Amazon.

Abstract

Biased sampling and missing data complicates statistical problems ranging from causal inference to reinforcement learning. We often correct for biased sampling of summary statistics with matching methods and importance weighting. In this paper, we study nearest neighbor matching (NNM), which makes estimates of population quantities from biased samples by substituting unobserved variables with their nearest neighbors in the biased sample. We show that NNM is $L_{2}$ -consistent in the absence of smoothness and boundedness assumptions in finite dimensions. We discuss applications of NNM, outline the barriers to generalizing this work to separable metric spaces, and compare this result to inverse probability weighting.

1 Introduction

Issues of representation in sampling have plagued data analysis ever since the first survey was taken. Biased sampling can be cast as a missing data problem, where data from the population of interest are partially missing, and data from the biased sample are non-missing. Let $Y\in\mathbb{R}$ be the missing at random variable of interest, then we would like to estimate its population mean. Suppose that we have covariates $X\in\mathbb{R}^{p}$ and we observe one iid sample, $\mathcal{X}_{M}$ , from the missing population and a sample of $X,Y$ pairs, from the non-missing biased distribution (let $\mathcal{X}_{N}$ be the corresponding sample of $X$ s). When we know the missing density, $X\sim\mu$ , and non-missing density, $X\sim\nu$ (to remember this think $\mu$ issing, and $\nu$ on-missing), we can estimate population level statistics with inverse probability weighting (IPW) and trimmed variants [15, 17]. When $\mu,\nu$ are unknown, our setting, we typically use matching methods or we estimate the density ratio and plug it into IPW. These methods construct a weight $W_{j}$ for each element $X_{j}\in\{X_{1},\ldots,X_{n}\}=\mathcal{X}_{N}$ that is dependent on $\mathcal{X}_{M},\mathcal{X}_{N}$ , and form an estimate $\hat{G}=\sum_{j=1}^{n}W_{j}Y_{j}$ . In this work, we study one of the simplest matching methods, nearest neighbor matching (NNM), which sets $W_{j}$ to be the proportion of elements in $\mathcal{X}_{M}$ for which $X_{j}$ is its nearest neighbor within $\mathcal{X}_{N}$ . We will show that it is consistent under minimal conditions.

Contributions. Our theory is broken down into three settings of increasing difficulty and realisticness: (1) Known $\mu$ , noiseless $Y$ , the only source of randomness is in the non-missing sample $\mathcal{X}_{N}$ ; (2) Unknown $\mu$ , noiseless $Y$ ; (3) Unknown $\mu$ , noisy $Y$ , this is the standard setting of NNM. One key lemma to establish (3) is related to the $\mu$ measure of Voronoi cells from our biased sample, and we provide a more complete characterization which may be of separate interest. Further, we discuss the conditions and contrast NNM to IPW, highlight the barriers to generalize the theory beyond finite dimensional spaces, and discuss applications to missing data problems.

1.1 Main Theoretical Results

If the missingness is completely at random (MCAR) then this means that there is no dependence between the missingness and the variables $X,Y$ . In this case, nothing special needs to be done in order to consistently estimate the mean of the missing $Y$ , we can just compute their empirical counterparts on the non-missing data. Instead, we will assume we are in the more realistic situation, that our data is missing at random (MAR), which means that the missingness is independent of $Y$ conditional on the covariate $X$ .

We observe iid covariates $\mathcal{X}_{M}\subset\mathbb{R}^{p}$ from density $\mu$ , and $\mathcal{X}_{N}=\{X_{1},\ldots,X_{n}\}\subset\mathbb{R}^{p}$ from density $\nu$ . Throughout we will assume that the $X$ variables are continuous and $p$ -dimensional. We will discuss later the numerous mathematical challenges of generalizing beyond this setting. For the non-missing data, we observe $Y_{1},\ldots,Y_{n}$ from the common distribution of $Y_{j}\sim f_{Y|X}(.|X_{j})$ (due to the MAR assumption). Our goal is to estimate,

[TABLE]

NNM can be expressed using Voronoi cells. A Voronoi cell contains the points in $\mathbb{R}^{p}$ such that $X_{j}$ is the closest member of $\mathcal{X}_{N}$ ,

[TABLE]

where $[n]:=\{1,\ldots,n\}$ (we will ignore ties because they are measure [math]). Let $\hat{M}(S)=|\mathcal{X}_{M}\cap S|/|\mathcal{X}_{M}|$ be the proportion of $\mathcal{X}_{M}$ within $S$ . NNM estimates $G$ by $\hat{G}=\sum_{j=1}^{n}W_{j}Y_{j}$ where $W_{j}=\hat{M}(S_{j})$ . Throughout, define the measures $M(S)=\int_{S}\mu(x){\rm d}x$ and $N(S)=\int_{S}\nu(x){\rm d}x$ .

Known $\mu$ , noiseless $Y$ . Throughout we will let $\eta(X)=\mathbb{E}[Y|X]$ , the true regression function. We require that $\eta$ is integrable, and will place moment conditions on it later. In the noiseless setting $Y=\eta(X)$ almost surely (AS). If we know $\mu$ then we do not need to rely on a finite sample $\mathcal{X}_{M}$ , instead we use the weight $W_{j}=M(S_{j})$ . In this special case, we will call the NNM estimator the 1NN measure of $\eta$ and denote it,

[TABLE]

where $X_{(1)}(x)$ is the nearest neighbor of $x$ within $\mathcal{X}_{N}$ . One should think of $\mathbb{Q}_{1}$ as a biased sampling analogue to the empirical measure (typically denoted $\mathbb{P}_{n}$ ).

Recall that the Renyi divergence for $q_{0}>1$ is

[TABLE]

Furthermore, we can take $q_{0}\rightarrow 1$ to obtain the KL-divergence ( $D_{1}$ ). Throughout we will assume that $q_{0},q_{1}\geq 1$ are Hölder conjugates ( $q_{0}^{-1}+q_{1}^{-1}=1$ ).

Assumption 1.

Let $q_{0}\geq 1$ be a constant then assume that the Renyi divergence is finite,

[TABLE]

Assumption 2.

Let $q_{1}\geq 1$ . The test function $\eta$ is measurable and has finite $2q_{1}$ moment,

[TABLE]

Notice that a bounded $D_{q_{0}}(\mu||\nu)$ is much less restrictive then assuming that the density ratio is bounded uniformly. Furthermore, we make no smoothness assumptions on $\eta$ .

Theorem 1.

Under Assumptions 1, 2, the 1NN measure is consistent for $\mu$ , namely,

[TABLE]

in $L^{2}$ norm as $n\rightarrow\infty$ .

Unknown $\mu$ , noiseless $Y$ . The main difference between this setting and the previous one is that the NNM weights require an estimate $\hat{M}(S_{j})$ of $M(S_{j})$ . The following is a relatively simple corollary to Theorem 1, but we state it here because it highlights the additional assumptions in this setting.

Theorem 2.

Under Assumptions 1, 2, the NNM estimate is consistent for noiseless $Y$ , namely,

[TABLE]

in $L^{2}$ norm as $n,m\rightarrow\infty$ .

Unknown $\mu$ , noisy $Y$ . In this problem, we assume that $Y_{j}=\eta(X_{j})+\epsilon_{j}$ where $\{\epsilon_{j}\}_{j=1}^{n}$ are independent with mean [math] and variance bounded by $V<\infty$ . This result hinges on our characterization of the Voronoi cells, and requires that the Chi-square divergence is finite.

Theorem 3.

Under Assumptions 1, 2 for $q_{0}=q_{1}=2$ ,

[TABLE]

in $L^{2}$ norm as $n,m\rightarrow\infty$ .

In the following section, we will contrast and relate these results to prior work.

1.2 Comparison to prior work

Matching methods for causal inference and missing data are appealing due to their relative simplicity, interpretability, and computational efficiency [14]. Matching can be done with replacement, where multiple missing samples can match to the same non-missing samples, such as NNM, or without replacement, such as optimal matching [21]; it has been shown that in some circumstances matching without replacement is inconsistent [22]. Increasingly, NNM has surfaced in machine learning applications such as covariate shift correction in classification [16], model based conditional independence testing [23], and deep clustering [7]. NNM is amenable to massive data applications due to fast approximate NN indexing and retrieval (for example, [20, 18]), and software for NNM and extensions have been developed [1].

The statistical efficiency of NNM has been studied in statistics and econometrics literature. In [2], it was shown that despite its popularity, in more than 1 dimensions NNM (and other similar matching methods) has a bias term that converges at a rate of $n^{-1/p}$ , markedly worse than the optimal $n^{-1/2}$ rate under Lipschitz continuity assumptions and bounded propensity score (the conditional probability of missingness). For this reason, corrective measures have been studied such as bias-correction, [3], where an overparametrized linear model is used to construct an additive bias estimate. Another work in response to the negative results of [2] form an estimate of the propensity score and use NNM in this 1 dimensional space [11, 5], however, this requires an accurate estimate of the propensity score. To the best of our knowledge, it remained unknown if NNM is even consistent without smoothness assumptions or bounded density ratio (the results in this work).

As we will see, Theorem 1 relies on existing results on nearest neighbor regression theory. To illustrate this under stronger assumptions— $\eta$ is $L$ -Lipschitz and bounded—then $|\eta(X_{(1)}(x))-\eta(x)|\leq L\|X_{(1)}(x)-x\|$ . We know by classical studies of KNN, [6], that the 1NN approaches the test point and boundedness gives us dominated convergence. Since, this work much has been discovered about KNN that we can potentially stand on the shoulders of. Of direct relevance to this work is [10], which studies the $L_{2}$ consistency of KNN regression in general metric spaces. Also, intermediate results from [12, 13] are relevant as well, particularly the density of Lipschitz functions in $L_{1}(\nu)$ for separable metric spaces. We will argue in Section 4 that, while promising, these results are insufficient to prove our desired Theorems.

We will see that to prove the final result, Theorem 3, we will require a characterization of the Voronoi cells. [8] provides an analogous result for unbiased sampling, and we extend this result to the biased sampling setting. They find that $\mathbb{E}[M(S_{j})|X_{j}=x]\rightarrow 1$ when $\mu=\nu$ and bound the limiting second moment. We extend this to find that $\mathbb{E}[M(S_{j})|X_{j}=x]\rightarrow\mu(x)/\nu(x)$ , which implies that NNM is unbiased for IPW. As mentioned, by [2] we know that this bias converges at a suboptimal rate under smoothness and boundedness assumptions. It is unknown, and outside of the scope of this work, if without these assumptions the optimal rate remains $n^{-1/2}$ or if this is information theoretically impossible.

2 Asymptotic measure of Voronoi cells

The main observation that we make in this section is that the $\mu$ -measure of a Voronoi cell of samples from $\nu$ approaches the density ratio. These results rely heavily on the finite dimensional setting, and it is used to prove NNM consistency in the unknown $\mu$ , noisy $Y$ case (Theorem 3). In fact, that result only requires Lemma 2 (3), but we provide our full result here because it is enlightening. This result can be interpreted as NNM is unbiased for importance sampling in the limit, since $M(S_{j})$ is the expected importance weight. To make this leap, we need to be specific regarding our Lebesgue points, which is a $\nu$ measure 1 set over which this limit holds. We follow this with our Voronoi cell result.

Lemma 1.

For any probability measure $\Pi$ with a density over $\mathbb{R}^{p}$ . There exists a set $\mathcal{X}$ such that $\Pi(\mathcal{X})=1$ and the following properties hold:

(1)

For any $x\in\mathcal{X}$ and $x^{\prime}\neq x$ , $\Pi(B(x^{\prime},\|x^{\prime}-x\|))>0$ and 2. (2)

for any $x\in\mathcal{X}$ and any $\delta>0$ there exists an $\zeta>0$ depending on $x$ such that if $\Pi(B(x^{\prime},\|x-x^{\prime}\|))\leq\zeta$ for some $x^{\prime}\neq x$ then we have that $\|x-x^{\prime}\|<\delta$ .

Lemma 2.

Under Assumption 1, in expectation, the $M$ -measure of a Voronoi cell around $X_{1}$ conditional on $X_{1}$ converges to the density ratio in the limit, namely,

[TABLE]

for $\nu$ -almost all $x$ (where the Lebesgue points are those described in Lemma 1 for $\Pi=N$ ). Furthermore, we have the following bound,

[TABLE]

Remark 1.

The proof, in the appendix, borrows some tricks from the corresponding result in [8], although we must adapt their proof to accommodate $\mu\neq\nu$ . There seems to be an issue with the validity of the proof of Theorem 2.1(i) in [8], particularly how the Lebesgue differentiation theorem applied to fixed points $v$ and $x$ can then translate to the similar result uniformly over $X_{0}$ (which is a draw from $\mu$ ). Our more complete study of the Lebesgue points in Lemma 1 resolves this potential oversight, completing and generalizing the proof.

Proof sketch of Lemma 2.

Recall that $S_{1}$ is the Voronoi cell around $X_{1}$ . As was done in [8] (in the case that $\mu=\nu$ ), we observe that for $X_{0}\sim\mu$ ,

[TABLE]

where $Z(x)=N(B(X_{0},\|X_{0}-x\|))$ . In the appendix, we use the Lebesgue differentiation theorem (LDT) and Lemma 1 to make precise the following string of approximations

[TABLE]

and it is straightforward to see that $Z_{0}(x)=M(B(x,\|X_{0}-x\|))$ has a uniform distribution, which after some derivation gives us (2). In order to establish (3), we follow a similar procedure. ∎

We can see the necessity of the assumption that these have densities with respect to the Lebesgue measure due to our use of $\lambda(B(X_{0},\|X_{0}-x\|))=\lambda(B(x,\|X_{0}-x\|))$ used in (4).

3 Proving $L_{2}$ -consistency of NNM

This section is primarily devoted to proving $L_{2}$ -consistency in the known $\mu$ , noiseless $Y$ case, Theorem 1. In order to prove Theorems 2, 3 we control the additional randomness due to $\mathcal{X}_{M}$ and noisy $Y$ . Both proofs are in the appendix, so the results are not restated here. For Theorem 3, we require Lemma 2. In short, the variance of the summand in $\hat{G}$ , $M(S_{j})Y_{j}$ , is bounded by $VM^{2}(S_{j})$ so we need to control the squared $\mu$ -measure of the Voronoi cells.

We will divide the proof of Theorem 1 into two thrusts: demonstrating asymptotic unbiasedness and diminishing variance. We will discuss in Section 4 how these results might be able to generalize to separable metric spaces.

Asymptotic unbiasedness of $\mathbb{Q}_{1}(\eta)$ follows almost immediately from finite dimensional nearest neighbor theory [4] and Hölder’s inequality. We give a proof sketch here to highlight how it might easily generalize to metric spaces, in the event that new NN regression theory is developed.

Theorem 4.

Let $q_{0},q_{1}$ be Hölder conjugates, suppose Assumption 1 and that $\mathbb{E}|\eta(X_{1})|^{q_{1}}<\infty$ . Then

[TABLE]

Proof.

By Hölders inequality,

[TABLE]

by Lemma 6 (see [4]) from classical NN theory, we have that

[TABLE]

which completes the proof (in fact it shows $L_{1}$ convergence). ∎

One can gain a better intuition by proving this using Lemma 2. Specifically, the expected 1NN measure is,

[TABLE]

We have pointwise convergence by (2),

[TABLE]

almost everywhere, and the RHS has expectation $\int\eta\mu$ . What remains is to show dominated convergence (see the alternative proof of Theorem 4 in Appendix). We also demonstrate in the Appendix using instructive examples that for finite $n$ the bias is unavoidable. These are typically cases where the LDT has non-uniform convergence (see (4)).

Diminishing variance. We have established that the 1NN measure is asymptotically unbiased, but $L_{2}$ -consistency remains to be shown. Our main tool for showing this consistency is the following variance bound, which holds without any additional assumptions then those stated within. Lemma 3 demonstrates that as long as $\mu$ and $\nu$ are not too dissimilar, the variance of the 1NN measure is bounded by the discrepancy between the first and second nearest neighbor interpolants.

Lemma 3.

Let $q_{0},q_{1}$ be Hölder conjugates then,

[TABLE]

The fact that $\mu,\nu$ are densities or even over $\mathbb{R}^{p}$ is actually not required. If one were to replace $\mu/\nu$ with the Radon-Nikodym derivative then the result would still hold. We conclude this subsection by showing that the 1NN measure has diminishing variance.

Theorem 5 (1NN measure variance).

Under Assumptions 1 and 2, we have that

[TABLE]

Proof.

In Lemma 7 (in Appendix) we establish that under Assumption 2 we have that for $X\sim\nu$ ,

[TABLE]

This result uses lemmata from the study of nearest neighbors regressors in [4]. Under Assumption 1, we have that $\mathbb{E}(\mu(X)/\nu(X))^{q_{0}}$ is bounded. Applying Lemma 3 we reach our conclusion. ∎

4 A closer look at the results and their assumptions

This section demonstrates some implications and potential generalizations of the above results. First, we discuss Assumptions 1, 2 and show that the 1NN measure is $L_{2}$ -consistent in situation where IPW is not. Second, we discuss potential generalizations to separable metric spaces and the major places in which the finite dimensional assumption is required in this work.

4.1 Comparison to Inverse Probability Weighting (IPW)

Comparing consistency conditions. For this comparison, it is sufficient to consider the known $\mu$ , noiseless $Y=\eta(X)$ case. We will see that there are situations in which NNM achieves consistency where IPW is not guaranteed consistency. The IPW estimate can be expressed as $\mathbb{P}_{n}(\tilde{\eta})$ where $\tilde{\eta}(x)=\eta(x)\cdot\mu(x)/\nu(x)$ . The $L_{2}$ weak law of large numbers states that if $\mathbb{V}(\tilde{\eta}(X))<\infty,X\sim\nu$ , i.e. has finite second moment, then we have that $\mathbb{P}_{n}(\tilde{\eta})\rightarrow\int\eta\mu$ . Hence, we can compare this condition,

[TABLE]

to the Assumptions 1, 2. To provide a natural comparison, we will use Hölders inequality, to obtain,

[TABLE]

as a stronger IPW condition, that is tight for some examples. Notice that this is a stronger condition than the Assumptions 1, 2, leaving us with the result that the 1NN measure is $L_{2}$ -consistent in situations where $L_{2}$ -consistency of IPW is not guaranteed.

Example where NNM is better than IPW. We construct one such example from the Student’s t-distribution. Let $\nu$ be $t_{k}$ -distributed with the degrees of freedom $k\in(3,4)$ , $\mu$ be $t_{k-1}$ distributed, and $\eta(x)=|x|$ . The density ratio $\mu(x)/\nu(x)=C_{k}(1+x^{2}/k)^{-k/2}/(1+x^{2}/(k-1))^{-(k+1)/2}$ . Then (IPW Condition) does not hold:

[TABLE]

since $\nu$ does not have finite fourth moment (for a constant $C_{k}$ ). However, we can select $q_{0}=3$ and $q_{1}=3/2$ to see that Assumptions 1, 2 hold since,

[TABLE]

and $\int\eta^{2q_{1}}\nu=\int|x|^{3}\nu(x){\rm d}x<\infty$ , both because $\nu$ has finite third moment. We can see that in simulation this bears out and NNM has lower mean squared error than IPW (Table 1).

Of course, in this example, one would use the trimmed variant of the IPW [17], where we replace the IPW with $W_{i}=(\mu(X_{i})/\nu(X_{i}))\cdot\mathbf{1}\{\mu(X_{i})/\nu(X_{i})<b_{n}\}$ . This trimming introduces bias, but as long as $\int\eta^{2}\nu<\infty$ we can obtain $L_{2}$ consistency by letting $b_{n}\rightarrow\infty$ (perhaps extremely slowly). It is worthwhile to remember that NNM does not require knowledge or an estimate of $\mu/\nu$ , while IPW and its trimmed variant does. One can interpret these observations as the following: NNM implicitly trims the importance weight, trading off more bias for less variance.

4.2 Generalizing to separable metric spaces

The restrictiveness of requiring $X$ to be continuous and finite dimensional is striking when we compare these results to what we know about KNN classification [13] and Proto-NN [12]. In this section we will highlight all of the places in which the finite dimensionality (FD) assumption is used in this paper and discuss approaches to generalizing to separable metric spaces.

Noiseless $Y$ . For the proof of Theorem 1, the only real place that the FD assumption was used is (5). In fact, we can use a recent result from [12] to establish consistency of the 1NN measure for separable metric spaces but under significantly more restrictive Assumptions than 1, 2. In that work, they show (Theorem 3) that ProtoNN is pointwise $L1$ -consistent for classification, and in the proof they show that when $\eta$ is bounded AS

[TABLE]

This is exactly (5) with $q_{1}=1$ but with an additional boundedness assumption. If then the density ratio $\mu(x)/\nu(x)$ is also bounded AS, this implies that $\mathbb{Q}_{1}(\eta)$ is $L_{1}$ -consistent (but not necessarily $L_{2}$ -consistent). Of course, a bounded density ratio and bounded $\eta$ dramatically weaken the result, making it not applicable to estimating expectations, variances, and many other moments, as well as not applicable to distributions such as normals, gammas, betas, etc.

It is worth attempting to weaken these assumptions and establish $L_{2}$ consistency using directly the proof techniques in [13], but there are specific barriers. First, one of the main tools used is the density of Lipschitz functions in $L_{1}(\nu)$ (where now $\nu$ is a Borel measure). However, we would require that Lipschitz functions are dense in $L_{p}(\nu)$ , which has not been established to the best of our knowledge (although we have no counter-example). Furthermore, the boundedness of $\eta$ is used to establish dominated convergence, and it is unclear how to get around this. To the best of our knowledge, establishing (5) under only moment assumptions in separable metric spaces is an open problem. Such a result would also be able to be used to tackle Theorem 2—unknown $\mu$ , noiseless $Y$ .

Finally, the proof of Theorem 4.3 in [10] indicates that (5) may be established for $q_{1}=2$ for bounded functions in metric spaces that satisfy the Besicovitch condition. Of course, the boundedness condition violates our assumptions, but the proof of the extension of Stone’s theorem (Theorem 3.4) contains an infinite dimensional analogue of Lemma 6. However, that result relies on a somewhat opaque condition (iii’) and it is unclear if it can be generalized to $L_{4}$ -convergence, which is needed for Theorem 2. In summary, there are promising approaches to generalizing the noiseless case to metric spaces, however, it is safely outside of the scope of this work.

Noisy $Y$ . The proof of Theorem 3 required the use of Lemma 2 (3). This was required to establish the convergence, $\mathbb{E}[\sum_{j}M^{2}(S_{j})]\rightarrow 0$ , and it is unclear how to do this without our characterizations of the $\mu$ measure of Voronoi cells. This condition is unavoidable, because the conditional variance of $\mathbb{V}(\hat{G}|\mathcal{X}_{N})=V\sum_{j}M^{2}(S_{j})$ for known $\mu$ and constant $\mathbb{V}(\epsilon_{i})=V$ . As mentioned these results were heavily reliant on the FD assumptions and continuous $X$ , since we appealed to the translation invariance of the Lebesgue measure. Furthermore, the only precedent that we have of characterizing Voronoi cells is [8] which is also in the FD setting. As mentioned, it may be that a weaker result than Lemma 2 would be sufficient.

5 Applications to missing data problems

5.1 Imputation in massive databases

We will consider statistics that are aggregates of non-linear elementwise operations (i.e. empirical moments). Most common aggregations on database tables, such as sum, mean, variance, covariance, and count along with grouping operations and filters can be expressed in this way. Specifically, let $Z\in\mathbb{R}^{d}$ be a partially missing random variable and $g:\mathbb{R}^{p}\times\mathbb{R}^{d}\to\mathbb{R}$ be a possibly non-linear integrable function then we will focus on estimating the following functional,

[TABLE]

which is the expectation of $g(X,Z)$ for the missing population. For example, suppose we would like to express the following query, select mean(log(Z)) where X < 1 and Z = missing, we could use the function $g(x,z)=1\{x<1\}\log z$ (in this example, $p=d=1$ ). Of course, we are not able to make such a query because it is based on unobserved data. NNM is equivalent to redefining $Y\leftarrow g(X,Z)$ and performing single imputation on the new $Y$ with the nearest neighbor in $X$ space. However, this can be done implicitly by precomputing the NNM weights based on $X$ , and then computing $\hat{G}$ for any arbitrary $g$ (without the need to recompute new weights for new $g$ ). The NNM weights need to be updated only when the index is modified via insert, delete, etc. These aggregate computations can be implemented with search indexing with approximate nearest neighbor, a standard technology for indexing in distributed databases.

5.2 Imputation of the trans-Atlantic slave trade

The trans-Atlantic slave trade (TAST), also known as the middle passage, refers to the slave ship voyages that brought African slaves to the Americas. The middle passage is reported to have forcibly migrated over 10 million Africans to the Americas over a roughly 3 century time span. The number of slaves that embarked from Africa is especially important since the number of slaves taken from Africa can impact other estimates that result from this. For example, when estimating the population of Africa in a given decade, demographers will use population growth models and more recent census data [19]. However, the population growth was stifled by the slave trade, and without accounting for it past populations will tend to be underestimated because the growth rate is overestimated.

The database that we use is the 2010 extended version of the Voyages database, [9]. There is a significant amount of missingness throughout the database— $76.5\%$ of the voyages have missing number of slaves at embarkation—which is the partially missing variable of interest. We apply NNM to compute the total number of slaves taken from Africa using the number of slave at arrival and the year for the voyage as covariates. In Figure 1, we can see the non-missing data and the 1NN imputed data (missing $Y$ s filled in with its matched value). The NNM estimate of the total number of slaves taken from Africa is $10$ , $644$ , $376$ , while the MCAR assumption over-estimates this— $11$ , $569$ , $160$ .

5.3 Assessing test loss under covariate shift

When the training and test datasets in supervised learning have different covariate distributions, then we have covariate shift [24]. Let $Y\in\mathbb{R}^{d}$ , $\Omega_{0}$ be the training data, $\Omega_{1}$ the validation data, and $\Omega_{2}$ the test data. By training a predictor $\hat{h}:\mathbb{R}^{p}\to\mathbb{R}^{d}$ on $\Omega_{0}$ , we can consider this fixed and obtain the validation losses $L_{i}=\ell(\hat{h}(X_{i}),Y_{i})$ for each $i\in\Omega_{1}$ . The test error can be estimated using NNM where $L$ is missing on the test data $\Omega_{2}$ and non-missing on $\Omega_{1}$ . Going beyond this, [16] has used NNM to perform domain adaptation where $\hat{h}$ is directly trained using a test error estimate with NNM. However, to demonstrate the validity of this approach we require uniform laws of large numbers, a future direction of research. Similarly, finite sample rates of convergence would be required to establish generalization error bounds. Overall, such results are a natural followup to this work.

Appendix A Explanation and examples

We will examine a few examples which put this theory to the test, and see numerically the convergence guaranteed in Theorem 1. Our variance bound in Lemma 3 is determined by the $L^{q_{1}}(\nu)$ norm of $\eta$ and $D_{2q_{0}}(\mu||\nu)$ . It is instructive to go over the outline the proof of Lemma 2, because the proof indicates which models will yield more slowly diminishing bias than others.

Example 1.

Let $\nu$ be Beta $(1.25,1)$ and $\mu$ be Beta $(.75,1)$ . Then $\mu(x)\propto x^{-0.25}$ and $\nu(x)\propto x^{0.25}$ , hence the density ratio $\mu(x)/\nu(x)\propto x^{-0.5}$ is diverging as $x\rightarrow 0$ . This is an example where $\mu,\nu$ have the same compact support. An unbounded density ratio causes challenges for the 1NN measure because it means that near [math] there is a significant amount of mass in $\mu$ but few data from $\nu$ to evaluate $\eta$ . We assessed the measure $M(S_{j})$ by Monte Carlo sampling with 1M samples from $\mu$ .

Figure 2 (right) depicts the density ratio and the $M$ -measure of the Voronoi cells. Because the Voronoi cells are random, we have that the measure is only on the average approaching the density ratio, and there is significant spread around the density ratio for a given location $X_{i}=x$ . Let $\eta(x)=x^{-0.25}$ , and we can see that $D_{2}(\mu||\nu)<\infty$ and $\int\eta(X_{1})^{4}<\infty$ , satisfying the assumptions. Despite having unbounded density ratio, $\mathbb{Q}_{1}(\eta)$ converges to its limiting expectation ( $1.5$ ) as we can see in Table 2. We can see that the spread of $M(S_{1})|X_{1}=x$ is greater for the larger density ratios, and furthermore, for finite samples this is biased downward for $x$ near 0.

Example 2.

Let $\mu$ be Gaussian $(0,\sigma^{2}=2.1)$ , $\nu$ be Gaussian $(0,1)$ , and $\eta(x)=x^{2}$ . The estimation of $\int x^{2}{\rm d}M$ is natural as the second moment of the unobserved population. This is an example where both densities are fully supported over $\mathbb{R}$ . The density ratio, $\mu(x)/\nu(x)\propto\exp(0.262x^{2})$ , is not only unbounded but growing exponentially. We can see from Figure 2 that near the origin the spread of $M(S_{j})$ is low, but far from the origin there is a larger spread and downward bias (in the finite sample). Due to this bias, the convergence of this example to its expectation is somewhat slower with a $0.8\%$ relative error at 10K samples (Table 2).

Example 3.

In order to see the effect of non-uniform convergence of the LDT we will use a pedagogical construction, the fat Cantor set (the Smith-Volterra-Cantor set). This set is constructed by the following algorithm: start with $\mathcal{C}=[0,1]$ ; for each $l=1,2,\ldots$ remove the middle $1/4^{l}$ of the remaining intervals, thereby splitting each interval into two parts. In simulation, we only perform 5 iterations due to our fine grid. The remaining set $\mathcal{C}$ has $\lambda$ measure of $1/2$ but does not contain any open intervals (it is entirely boundary and has no interior). Let $\nu$ be uniform $(0,1)$ and $\mu$ be uniform $(\mathcal{C})$ . We can make $\eta(x)=2\cdot 1\{x\in\mathcal{C}\}$ and so $\int\eta{\rm d}M=2$ . This example has bounded $\mu,\nu$ over a compact domain, and bounded $\eta$ .

The fractal nature of this example causes non-uniform convergence of the LDT because we know that the $M$ measure of a small enough interval around $x$ approaches either [math] (if $x\notin\mathcal{C}$ ) or $1$ (if $x\in\mathcal{C}$ ). However, the Fat Cantor set looks from afar as if it does have low and high density regions, and this is manifested in the fact that for $x$ within small intervals that were removed, $M(S_{1})|X_{1}=x$ is non-zero. In the subfigure to the right of Figure 3, we can see the density ratio is [math] in small intervals but, because these are surrounded by elements within $\mathcal{C}$ , the Voronoi cells have large measure, $M(S_{i})$ . Due to the fractal nature of the fat Cantor set, for any sample size $n$ , this effect will always be manifested at some location at a small enough scale.

Regardless of this non-uniform convergence of the LDT, we observe that $\mathbb{Q}_{1}(\eta)$ converges to its limit, because these regions where the LDT has not yet converged are increasingly small. We see in Table 2 that with 10K samples, we achieve a relative error of $0.15\%$ .

Appendix B Lemmata

Lemma 4 ([4] Lemma 9.1).

Suppose that $X,X_{1},\ldots,X_{n}$ are drawn iid from a measure with a density in $\mathbb{R}^{p}$ . Let $g:\mathbb{R}^{p}\rightarrow\mathbb{R}$ be a Borel measurable function such that $\mathbb{E}|g(X)|^{q}<\infty$ . Then

[TABLE]

where $\gamma_{p}$ is a universal constant depending on dimension $p$ .

Lemma 5 (Stone’s Lemma, [25]; [4] Lemma 10.7).

Suppose that $X,X_{1},\ldots,X_{n}$ are drawn iid from $N$ (a measure over the Borel $\sigma$ -field on $\mathbb{R}^{p}$ ), and let $X_{(k)}(X)$ denote the $k$ NN of $X$ within $X_{1},\ldots,X_{n}$ . Let $v_{1},\ldots,v_{n}$ denote a probability weight vector such that $v_{1}\geq\ldots\geq v_{n}$ . Let $g:\mathbb{R}^{p}\rightarrow\mathbb{R}$ be a Borel measurable function such that $\mathbb{E}|g(X)|<\infty$ . Then

[TABLE]

where $\gamma_{p}$ is the minimum number of cones of angle $\pi/12$ that cover $\mathbb{R}^{p}$ .

Lemma 6 ([4] Lemma 10.2).

Suppose that $X,X_{1},\ldots,X_{n}$ are drawn iid from $N$ (a measure over the Borel $\sigma$ -field on $\mathbb{R}^{p}$ ), and let $X_{(k)}(X)$ denote the $k$ NN of $X$ within $X_{1},\ldots,X_{n}$ . Let $q\geq 1$ , and $g:\mathbb{R}^{p}\rightarrow\mathbb{R}$ be a Borel measurable function such that $\mathbb{E}|g(X)|^{q}<\infty$ . Suppose that the following conditions hold:

(i)

There is a $C$ such that for every Borel measurable $g$ , for all $n\geq 1$ ,

[TABLE] 2. (ii)

There is a constant $D\geq 1$ such that for all $n\geq 1$ ,

[TABLE] 3. (iii)

For all $a>0$ ,

[TABLE]

Then

[TABLE]

Lemma 7.

Suppose that $X,X_{1},\ldots,X_{n}$ are drawn iid from $N$ (a measure over the Borel $\sigma$ -field on $\mathbb{R}^{p}$ ), and let $X_{(k)}(X)$ denote the $k$ NN of $X$ within $X_{1},\ldots,X_{n}$ . Let $q\geq 1$ , and $g:\mathbb{R}^{p}\rightarrow\mathbb{R}$ be a Borel measurable function such that $\mathbb{E}|g(X)|^{q}<\infty$ , then

[TABLE]

Proof.

Let $W_{ni}(X)=\frac{1}{2}(1\{X_{(1)}(X)=X_{i}\}-1\{X_{(2)}(X)=X_{i}\})$ , then letting $v_{1}=1/2,v_{2}=1/2$ we see that condition (i) in Lemma 6 holds by Lemma 5. (ii) holds trivially by selecting $D=1$ . (iii) holds by Lemma 2.2 in [4] which states that for $x\in{\rm supp}(\mu)$ , $\|X_{(k)}-x\|\rightarrow 0$ almost surely (for $k/n\rightarrow 0$ ). ∎

Appendix C Proofs of main results

Proof of Lemma 1.

Let $\mathcal{A}$ be the set of all $x$ such that for some $x^{\prime}\neq x$ , $\Pi(B(x^{\prime},\|x^{\prime}-x\|))=0$ , and call the set of all such balls, $\mathcal{F}$ . Let $\mathcal{Z}$ be

[TABLE]

Since it is the union of open sets, $\mathcal{Z}$ is open, and by the Lindelöf Covering Theorem, there is a countable subset of $\mathcal{F}$ , $\mathcal{G}$ , such that the interiors of the balls cover $\mathcal{Z}$ . Thus, by countable subadditivity of measures,

[TABLE]

We have that $\mathcal{A}\backslash\mathcal{Z}$ is $\sigma$ -porous which means that there is an $\alpha\in(0,1)$ such that every element $x\in\mathcal{A}\backslash\mathcal{Z}$ there is an $r_{0}>0$ such that $1/r_{0}\in\mathbb{Z}$ where for any $r<r_{0}$ , there exists a $y\in\mathbb{R}^{p}$ with

[TABLE]

To see this let $y$ be on the segment between $x^{\prime}$ and $x$ in the above construction and $\alpha\leq 1/2$ . By the Lebesgue differentiation theorem, porous sets have Lebesgue measure [math] [26]. Hence, $\Pi(\mathcal{A}\backslash\mathcal{Z})=0$ since $\sigma$ -porous sets are countable unions of porous sets, by countable subadditivity, and $\Pi\ll\lambda$ . Let $\mathcal{X}^{C}=\mathcal{A}$ , and we have that $\Pi(\mathcal{X})\geq 1-\Pi(\mathcal{A}\backslash\mathcal{Z})-\Pi(\mathcal{Z})=1$ .

We will show (2) by supposing its contradiction, that for some $x\in\mathcal{X}$ and $\delta>0$ , for every $\gamma>0$ there exists a $x^{\prime}\neq x$ such that $\|x-x^{\prime}\|\geq\delta$ and $\Pi(B(x^{\prime},\|x^{\prime}-x\|))\leq\gamma$ . This implies that there exists a sequence of points $\{z_{l}\}_{l=1}^{\infty}$ , such that $\|z_{l}-x\|=\delta$ and $\Pi(B(z_{l},\delta))\leq 1/l^{2}$ . Define $A_{m}=\cup_{l=m}^{\infty}B(z_{l},\delta)$ then we have that $\Pi(A_{m})\rightarrow 0$ as $m\rightarrow\infty$ . By the Bolzano-Weierstrass theorem there exists an accumulation point of $z_{l}$ , $z^{\prime}$ with $\|z^{\prime}-x\|=\delta$ (by continuity of $\|.\|$ ). The interior of $B(z^{\prime},\delta)$ is contained in $A_{m}$ for all $m$ . By absolute continuity with respect to Lebegue measure, $\Pi({\rm int}(B(z^{\prime},\delta)))=\Pi(B(z^{\prime},\delta))=\Pi(B(z^{\prime},\|z^{\prime}-x\|))>0$ by the fact that $x\in\mathcal{X}$ . This contradicts the fact that $\Pi(A_{m})\rightarrow 0$ . ∎

Proof of Lemma 2.

Throughout, let $C$ be some constant and $x$ be a Lebesgue point as in Lemma 1 (for $N$ ) within ${\rm supp}(M)$ . Let $X_{0}\sim M$ and notice that

[TABLE]

where $Z(x)=N(B(X_{0},\|X_{0}-x\|))$ . By integration by parts,

[TABLE]

By the Lebesgue differentiation theorem,

[TABLE]

Notice that if $\|x_{0}-x\|\rightarrow 0$ , the sets, $B(x_{0},\|x_{0}-x\|)$ converges regularly to $x$ , in the sense that

[TABLE]

and by the doubling property of $\lambda$ ,

[TABLE]

where $C_{p}$ is a constant based on dimension, $p$ . Hence,

[TABLE]

as $\delta\rightarrow 0$ again by the LDT. Because $\lambda(B(x,\|x_{0}-x\|))=\lambda(B(x_{0},\|x_{0}-x\|))$ we have that,

[TABLE]

For $\gamma>0$ let $\delta$ be such that for any $x_{0}$ with $\|x_{0}-x\|\leq\delta$ ,

[TABLE]

Let $\eta\in(0,1)$ guaranteed in Lemma 1 (ii) based on $x,\delta$ .

[TABLE]

as $n\rightarrow\infty$ .

Thus, if we denote $Z_{0}(x)=M(B(x,\|X_{0}-x\|))$ ,

[TABLE]

Because $Z_{0}(x)$ follows a uniform $(0,1)$ distribution then

[TABLE]

for $n\rightarrow\infty$ . Similarly,

[TABLE]

Hence, by setting $\gamma$ arbitrarily small,

[TABLE]

In order to establish (3), we will follow a similar procedure. Let $X_{0},X_{0}^{\prime}\sim M$ independently.

[TABLE]

where $Z_{2}(x)=N(B(X_{0},\|X_{0}-x\|)\cup B(X_{0}^{\prime},\|X_{0}^{\prime}-x\|))$ . Define

[TABLE]

then $Z_{2}(x)\geq\tilde{Z}_{2}(x)$ . As before, by integration by parts,

[TABLE]

By (6), for any $\gamma>0$ we can select a $\delta$ such that for any $x_{0},x_{0}^{\prime}\in B(x,\delta)$ ,

[TABLE]

Let $\eta$ be selected as before,

[TABLE]

where $Z_{3}(x)=\max\{M(B(x,\|X_{0}-x\|)),M(B(x,\|X_{0}^{\prime}-x\|))\}$ . The elements in the maximum are independent uniform $(0,1)$ random variables, and so the maximum has a $\sqrt{U}$ distribution for uniform $U$ .

[TABLE]

Also, as before

[TABLE]

Finally, by setting $\gamma$ arbitrarily small

[TABLE]

∎

Alternative proof of Theorem 4.

Consider

[TABLE]

We have pointwise convergence by (2),

[TABLE]

almost everywhere, and the RHS has expectation $\int\eta\mu$ . We can establish dominated convergence by

[TABLE]

where $q_{0},q_{1}$ are Hölder conjugates. By assumption the first term on the RHS is bounded, what remains is to bound the second term. This can be established using theory developed primarily in [25]. A direct application of Lemma 4 to (7) concludes our proof. ∎

Proof of Lemma 3.

We will appeal to the Efron-Stein inequality, which states the following: Let $\mathbf{X}^{\prime}=(X^{\prime}_{1},\ldots,X^{\prime}_{n})$ be an iid copy of $\mathbf{X}=(X_{1},\ldots,X_{n})$ and $\mathbf{X}^{(i)}=(X_{1},\ldots,X_{i-1},X_{i}^{\prime},X_{i+1},\ldots,X_{n})$ , then for any function $F(\mathbf{X})$

[TABLE]

Let $F(\mathbf{X})=\mathbb{Q}_{1}(\eta)$ and denote $\mathbb{Q}^{(i)}$ as the 1NN measure formed from the data, $\mathbf{X}^{(i)}$ . Due to exchangeability,

[TABLE]

Let $X^{-}_{(1)}(x)$ and $\mathbb{Q}^{-}$ denote the 1NN within and the 1NN measure formed from the reduced data $X_{2},\ldots,X_{n}$ . We have that

[TABLE]

In order for $\eta(X^{-}_{(1)}(x))$ to differ from $\eta(X_{(1)}(x))$ it must be that $X_{(1)}(x)=X_{1}$ and $X^{-}_{(1)}(x)=X_{(2)}(x)$ . Thus,

[TABLE]

and so,

[TABLE]

Let $f=\mu/\nu$ be the density ratio. Considering this term,

[TABLE]

∎

Proof of Theorem 2.

The random vector $(m\hat{M}(S_{j}))_{j=1}^{n}$ is multinomial $(m,(M(S_{j}))_{j=1}^{n})$ conditional on $\mathcal{X}_{N}$ . The MSE

[TABLE]

The second term converges to [math] by Theorem 1.

[TABLE]

The conditional variance is

[TABLE]

Hence, under Assumptions 1, 2,

[TABLE]

by Theorem 4. (Notice that Theorem 4 only requires the $q_{1}$ moment bound of the test function, which is satisfied for $\eta^{2}$ by Assumption 2.) ∎

Proof of Theorem 3.

Define $\tilde{G}=\sum_{j}\hat{M}(S_{j})\cdot\eta(X_{j})$ ( $\hat{G}$ in the noiseless setting). Let $\mathcal{X}=\mathcal{X}_{N}\cup\mathcal{X}_{M}$ be all of the covariates,

[TABLE]

The last term converges to [math] by Theorem 2. The inner term is dominated because

[TABLE]

because $\sum_{j}\hat{M}^{2}(S_{j})\leq\sum_{j}\hat{M}(S_{j})=1$ . Consider

[TABLE]

Because $m\hat{M}(S)$ is binomial $(m,M(S))$ for fixed $S$ we have that,

[TABLE]

Since $n\mathbb{E}[M(S_{1})]=1$ we have that $\frac{n}{m}\mathbb{E}[M(S_{1})]\rightarrow 0$ if $m\rightarrow\infty$ . We have by Lemma 2 and dominated convergence,

[TABLE]

Hence,

[TABLE]

∎

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Abadie, J. L. Herr, G. Imbens, and D. M. Drukker. Nnmatch: Stata module to compute nearest-neighbor bias-corrected estimators, 2004.
2[2] A. Abadie and G. W. Imbens. Large sample properties of matching estimators for average treatment effects. econometrica , 74(1):235–267, 2006.
3[3] A. Abadie and G. W. Imbens. Bias-corrected matching estimators for average treatment effects. Journal of Business & Economic Statistics , 29(1):1–11, 2011.
4[4] G. Biau and L. Devroye. Lectures on the nearest neighbor method . Springer, 2015.
5[5] M. Busso, J. Di Nardo, and J. Mc Crary. New evidence on the finite sample properties of propensity score reweighting and matching estimators. Review of Economics and Statistics , 96(5):885–897, 2014.
6[6] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE transactions on information theory , 13(1):21–27, 1967.
7[7] Z. Dang, C. Deng, X. Yang, K. Wei, and H. Huang. Nearest neighbor matching for deep clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13693–13702, 2021.
8[8] L. Devroye, L. Györfi, G. Lugosi, and H. Walk. On the measure of voronoi cells. Journal of Applied Probability , 54(2):394–408, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On L2L_{2}L2​-consistency of nearest neighbor matching

Abstract

1 Introduction

1.1 Main Theoretical Results

Assumption 1**.**

Assumption 2**.**

Theorem 1**.**

Theorem 2**.**

Theorem 3**.**

1.2 Comparison to prior work

2 Asymptotic measure of Voronoi cells

Lemma 1**.**

Lemma 2**.**

Remark 1**.**

Proof sketch of Lemma 2.

3 Proving L2L_{2}L2​-consistency of NNM

Theorem 4**.**

Proof.

Lemma 3**.**

Theorem 5** (1NN measure variance).**

Proof.

4 A closer look at the results and their assumptions

4.1 Comparison to Inverse Probability Weighting (IPW)

4.2 Generalizing to separable metric spaces

5 Applications to missing data problems

5.1 Imputation in massive databases

5.2 Imputation of the trans-Atlantic slave trade

5.3 Assessing test loss under covariate shift

Appendix A Explanation and examples

Example 1.

Example 2.

Example 3.

Appendix B Lemmata

Lemma 4** ([4] Lemma 9.1).**

Lemma 5** (Stone’s Lemma, [25]; [4] Lemma 10.7).**

Lemma 6** ([4] Lemma 10.2).**

Lemma 7**.**

Proof.

Appendix C Proofs of main results

Proof of Lemma 1.

Proof of Lemma 2.

Alternative proof of Theorem 4.

Proof of Lemma 3.

Proof of Theorem 2.

Proof of Theorem 3.

On $L_{2}$ -consistency of nearest neighbor matching

Assumption 1.

Assumption 2.

Theorem 1.

Theorem 2.

Theorem 3.

Lemma 1.

Lemma 2.

Remark 1.

3 Proving $L_{2}$ -consistency of NNM

Theorem 4.

Lemma 3.

Theorem 5 (1NN measure variance).

Lemma 4 ([4] Lemma 9.1).

Lemma 5 (Stone’s Lemma, [25]; [4] Lemma 10.7).

Lemma 6 ([4] Lemma 10.2).

Lemma 7.