Local nearest neighbour classification with applications to semi-supervised learning
Timothy I. Cannings, Thomas B. Berrett, Richard J. Samworth

TL;DR
This paper introduces a new asymptotic analysis of local-$k$-nearest neighbour classifiers, revealing conditions for optimal excess risk rates and proposing a semi-supervised variant that adapts to feature density estimates.
Contribution
It derives an asymptotic expansion for the excess risk of local-$k$-NN, proposes a semi-supervised classifier with improved convergence rates, and establishes minimax optimality of the method.
Findings
Achieves an $O(n^{-4/(d+4)})$ excess risk rate under certain conditions.
Semi-supervised local-$k$-NN attains minimax optimal rates.
Simulation confirms theoretical advantages.
Abstract
We derive a new asymptotic expansion for the global excess risk of a local--nearest neighbour classifier, where the choice of may depend upon the test point. This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the decision boundary of the optimal Bayes classifier, but we also show that if these conditions are not satisfied, then the dominant contribution may arise from the tails of the marginal distribution of the features. Moreover, we prove that, provided the -dimensional marginal distribution of the features has a finite th moment for some (as well as other regularity conditions), a local choice of can yield a rate of convergence of the excess risk of , where is the sample size, whereas for the standard -nearest neighbour classifier, our theory would require andâŠ
| Bayes risk | nn risk | nn risk | nn risk | O RR | SS RR | ||
|---|---|---|---|---|---|---|---|
| Setting 1 | |||||||
| 1 | 22.67 | 50 | |||||
| 200 | |||||||
| 1000 | |||||||
| 2 | 13.30 | 50 | |||||
| 200 | |||||||
| 1000 | |||||||
| 5 | 3.53 | 50 | |||||
| 200 | |||||||
| 1000 | |||||||
| Setting 2 | |||||||
| 1 | 31.16 | 50 | |||||
| 200 | |||||||
| 1000 | |||||||
| 2 | 31.15 | 50 | |||||
| 200 | |||||||
| 1000 | |||||||
| 5 | 20.10 | 50 | |||||
| 200 | |||||||
| 1000 | |||||||
| Setting 3 | |||||||
| 1 | 37.44 | 50 | |||||
| 200 | |||||||
| 1000 | |||||||
| 2 | 37.45 | 50 | |||||
| 200 | |||||||
| 1000 | |||||||
| 5 | 23.23 | 50 | |||||
| 200 | |||||||
| 1000 | |||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Local nearest neighbour classification with applications to semi-supervised learning
Timothy I. Cannings
ââ
label=e1][email protected]=u1 [[ ââ
url]http://www.maths.ed.ac.uk/%7Etcannings
ââ
Thomas B. Berrettlabel=e2][email protected]=u2 [[ ââ
url]www.statslab.cam.ac.uk/%7Etbb26
ââ
Richard J. Samworthlabel=e3][email protected]=u3 [[ ââ
url]www.statslab.cam.ac.uk/%7Erjs57
University of Edinburgh\thanksmarkm1 and University of Cambridge\thanksmarkm2
School of Mathematics
James Clerk Maxwell Building
Peter Guthrie Tait Road
Edinburgh EH9 3FD
Statistical Laboratory
Centre for Mathematical Sciences
Wilberforce Road
Cambridge CB3 0WB
Abstract
We derive a new asymptotic expansion for the global excess risk of a local--nearest neighbour classifier, where the choice of may depend upon the test point. This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the decision boundary of the optimal Bayes classifier, but we also show that if these conditions are not satisfied, then the dominant contribution may arise from the tails of the marginal distribution of the features. Moreover, we prove that, provided the -dimensional marginal distribution of the features has a finite th moment for some (as well as other regularity conditions), a local choice of can yield a rate of convergence of the excess risk of , where is the sample size, whereas for the standard -nearest neighbour classifier, our theory would require and finite moments to achieve this rate. These results motivate a new -nearest neighbour classifier for semi-supervised learning problems, where the unlabelled data are used to obtain an estimate of the marginal feature density, and fewer neighbours are used for classification when this density estimate is small. Our worst-case rates are complemented by a minimax lower bound, which reveals that the local, semi-supervised -nearest neighbour classifier attains the minimax optimal rate over our classes for the excess risk, up to a subpolynomial factor in . These theoretical improvements over the standard -nearest neighbour classifier are also illustrated through a simulation study.
62G20,
classification problems,
nearest neighbours,
nonparametric classification,
semi-supervised learning,
keywords:
[class=MSC]
keywords:
\arxiv
arXiv:1704.00642 \startlocaldefs\endlocaldefs
and
t1Research supported by an Engineering and Physical Sciences Research Council (EPSRC) programme grant.
t2Research supported by an EPSRC Fellowship and programme grant, as well as a grant from the Leverhulme Trust.
1 Introduction
Supervised classification problems represent some of the most frequently-occurring statistical challenges in a wide variety of fields, including fraud detection, medical diagnoses and targeted advertising, to name just a few. The area has received an enormous amount of attention within both the statistics and machine learning communities; for an excellent survey with pointers to much of the relevant literature, see Boucheron et al. (2005).
The -nearest neighbour classifier, which assigns the test point according to a majority vote over the classes of its nearest points in the training set, was introduced in the seminal work of Fix and Hodges (1951) (later republished as Fix and Hodges (1989)), and is arguably the simplest and most intuitive nonparametric classifier. Cover and Hart (1967) provided mild conditions under which the asymptotic risk of the -nearest neighbour classifier is bounded above by twice the risk of the optimal Bayes classifier. Stone (1977) proved that if is chosen such that and as , then the -nearest neighbour classifier is universally consistent, in the sense that under any data generating mechanism, its risk converges to the Bayes risk. Further recent contributions, some of which treat the -nearest neighbour classifier as a special case of a plug-in classifier, include Kulkarni and Posner (1995), Audibert and Tsybakov (2007), Hall et al. (2008), Biau et al. (2010), Samworth (2012), Chaudhuri and Dasgupta (2014) and Celisse and Mary-Huard (2018). Nearest neighbour methods have also been extensively used in other statistical problems, including density estimation (Loftsgaarden and Quesenberry, 1965; Mack and Rosenblatt, 1979; Mack, 1983), nonparametric clustering, (Heckel and Bölcskei, 2015), entropy and other functional estimation (Kozachenko and Leonenko, 1987; Berrett et al., 2019; Berrett and Samworth, 2019a) and testing problems (Schilling, 1986; Berrett and Samworth, 2019b); see also the recent book Biau and Devroye (2015).
Despite these aforementioned works, the behaviour of the -nearest neighbour classifier in the tails of a distribution remains poorly understood. Indeed, writing for a generic data pair, where the -dimensional feature vector has marginal density and denotes a binary class label, most of the results in the papers mentioned in the previous paragraph pertain either to situations where is compactly supported and bounded away from zero on its support, or where the excess risk over that of the Bayes classifier is computed only over a compact subset of . As such, many questions remain regarding the effect of tail behaviour on the excess risk.
In this paper, we consider classes of distributions that allow the feature vectors to have unbounded support. Our first goal is to provide a new asymptotic expansion for the global excess risk of a -nearest neighbour classifier, whose error term can be bounded uniformly over our classes (Theorem 1). This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the decision boundary of the Bayes classifier, but we also show that if these conditions are not satisfied, then the dominant contribution may arise from the tails of the marginal distribution of the features. The threshold for these two different regimes is governed by a parameter that controls the number of finite moments of the marginal feature distribution: if and , then we obtain a rate of uniformly over our classes, while if or and then our rate is slower, namely , for every .
The proof of Theorem 1 also reveals a local bias-variance trade-off that motivates a modification of the standard -nearest neighbour classifier in semi-supervised learning settings, where, as well as the labelled training data, we have access to another, independent, sample of unlabelled observations. Such semi-supervised problems occur in a wide range of applications, especially where it is expensive or time-consuming to obtain the labels associated with observations; in fact, it is often the case that unlabelled observations may vastly outnumber labelled ones. For an overview of semi-supervised learning applications and techniques, see Chapelle et al. (2006).
Our second contribution is to propose to allow the choice of in -nearest neighbour classification to depend on an estimate of at the test point in semi-supervised settings. Such a local choice of is analagous to the use of local bandwidths in the context of kernel density estimation, as studied by, e.g., Breiman et al. (1977), Abramson (1982) and Giné and Sang (2010). However, for density estimation, it is more common to choose a family of bandwidths rather than , to ensure that the resulting estimate is itself a density. Moreover, theory there suggests that one should then choose in order to cancel the leading term in the asymptotic bias expansion (Abramson, 1982). By contrast, we find that when choosing , by using fewer neighbours in low density regions, we are able to achieve a better balance in the local bias-variance trade-off for estimating our main quantity of interest, namely the regression function. In particular, we initially study an oracle choice of that depends on , and show that the excess risk of the resulting classifier, computed over the whole of , is , again uniformly over our classes, for every and provided only that . Moreover, in the more challenging case where , we obtain a rate of , for every , which still reflects an improvement through the locally-adaptive choice of . Assuming further that has Hölder smoothness , we show that if additional, unlabelled observations are used to estimate by , and if satisfies , then our semi-supervised -nearest-neighbour classifier mimics the asymptotic performance of the oracle.
Finally, we consider corresponding minimax lower bounds. We show in particular that the rates of convergence achieved by our semi-supervised, local--nearest neighbour classifier are optimal up to subpolynomial factors in . Interestingly, our arguments also reveal that these rates cannot be improved with the additional knowledge of .
As mentioned previously, studies of global excess risk rates of convergence in nonparametric classification for unbounded feature vector distributions are comparatively rare. Hall and Kang (2005) studied the tail error properties of a classifier based on kernel density estimates of the class conditional densities for univariate data. As an illustrative example, they showed that if, for large , one class has density , while the other has density , for some and , then the excess risk from the right tail is of larger order than that in the body of the distribution.
Perhaps most closely related to this work, Gadat et al. (2016) recently obtained upper bounds on the supremum excess risk of the -nearest neighbour classifier, over classes where is Lipschitz, the well-known margin assumption of Mammen and Tsybakov (1999) is satisfied with parameter , and assuming the tail condition that is satisfied for some function and sufficiently small . Gadat et al. (2016) obtained a minimax lower bound over these classes, as well as providing an upper bound for the rate of the standard -nearest neighbour classifier. Since these rates do not match, they further introduced regions of the form \bigl{\{}\bar{f}^{-1}\bigl{(}(a_{j+1},a_{j}]\bigr{)}:j\in\mathbb{N}\bigr{\}} with , and proved that when we choose and specialise to the case where is the identity function, the resulting sliced -nearest neighbour classifier attains the minimax optimal rate of up to a polylogarithmic factor in . Neither our smoothness and tail assumptions, nor our conclusions are directly comparable with the work of Gadat et al. (2016). In particular, we make a stronger smoothness assumption on in a neighbourhood of the Bayes decision boundary, implying that the margin assumption holds with parameter ; see Lemma A.12 in Appendix A. This enables us to show that our semi-supervised classifier attains faster rates than are achievable under just a Lipschitz condition, and that these rates are minimax optimal up to subpolynomial factors in , over all possible values of our tail parameter ; moreover, we are also able to provide the leading constants in the asymptotic expansion of the excess risk in some cases.
The remainder of this paper is organised as follows. After introducing our setting in Section 2, we present in Section 3 our main results for the standard -nearest neighbour classifier. This leads on, in Section 4, to our study of the semi-supervised setting, where we derive asymptotic results of the excess risk of our local--nearest neighbour classifier. Our minimax lower bound in presented in Section 5. The main arguments of the proofs of our theoretical results are given in Section 6, while in the appendices, we prove several claims made in the main text, bound various remainder terms, illustrate the finite-sample benefits of the semi-supervised classifier over the standard -nearest neighbour classifier in a simulation study and provide an introduction to the ideas of differential geometry that underpin much of our analysis.
Finally we fix here some notation used throughout the paper. Let denote the Euclidean norm and, for and , let and denote respectively the open and closed Euclidean balls of radius centred at . Let denote the -dimensional Lebesgue measure of . For a real-valued function defined on that is twice differentiable at , write and \ddot{g}(x)=\bigl{(}g_{jk}(x)\bigr{)} for its gradient vector and Hessian matrix at , and let . We write for the operator norm of a matrix.
2 Statistical setting
Let be independent and identically distributed random pairs taking values in . Let , for , and , for , where is a probability measure on . Let denote the regression function and denote the marginal distribution of . We observe labelled training data, , and unlabelled training data, , and are presented with the task of assigning the test point to either class 0 or 1.
A classifier is a Borel measurable function , with the interpretation that assigns to the class . Given a Borel measurable set , the misclassification rate, or risk, over is
[TABLE]
When , we drop the subscript for convenience. The Bayes classifier
[TABLE]
minimises the risk over any region (Devroye et al., 1996, p. 20). The performance of a classifier is therefore measured via its excess risk, .
We can now formally define the local--nearest neighbour classifier, which allows the number of neighbours considered to vary depending on the location of the test point. Suppose is measurable. Given the test point , let be a reordering of the training data such that . We will later assume that is absolutely continuous with respect to -dimensional Lebesgue measure, which ensures that ties occur with probability zero; where helpful for clarity, we also write for the th nearest neighbour of . Let . Then the local--nearest neighbour (nn) classifier is defined to be
[TABLE]
Given , let denote the constant function for all . Using the definition above reduces to the standard -nearest neighbour classifier (nn), and we will write in place of . For , let
[TABLE]
denote a range of values of that will be of interest to us. Note that , for . Moreover, when is small, the restriction that is only a slightly stronger requirement than the consistency conditions of Stone (1977), namely that , as .
3 Global risk of the -nearest neighbour classifier
In this section we provide an asymptotic expansion for the global risk of the standard (non-local) -nearest neighbour classifier. We first define the classes of data generating mechanisms over which our results will hold. Let denote the class of decreasing functions such that as , for every . Let denote the class of strictly increasing functions with as , for every . Recall from Section 2 that, to any distribution on , we associate conditional distributions , a regression function , marginal probabilities and a marginal distribution . Now, for , and , let denote the class of distributions on such that the probability measures and are absolutely continuous with respect to Lebesgue measure, with RadonâNikodym derivatives and , respectively. Moreover, we assume that there exist versions of and for which the following conditions hold:
(A.1)
The marginal density of , namely , is continuous -almost everywhere and the set of continuity points of is open.
Thus , where we define . Let and, for , let . In our assumptions below, we will place further assumptions on , which ensure not only that this set is non-empty, but in fact that it is a -dimensional, orientable manifold.
(A.2)
The set is non-empty and . The function is twice continuously differentiable on , and
[TABLE]
for all . Furthermore, writing p_{r}(x):=P_{X}\bigl{(}B_{r}(x)\bigr{)}, we have for all and that
[TABLE]
(A.3)
We have that is twice differentiable on with . Moreover, , and given ,
[TABLE]
Finally, the function is continuous on , and
[TABLE]
for all .
(A.4)
We have .
Example 1**.**
Consider the distribution on for which and . In Appendix B, we show that with for any , , and provided that M_{0}\geq\max\bigl{\{}2,\frac{\Gamma(3+d/2)}{8\pi^{d/2}}\bigr{\}}, \epsilon_{0}\leq\min\bigl{(}\frac{1}{10},2^{-d},\frac{2^{1/2}}{M_{0}}\bigr{)} and satisfies for all .
Asking for to have a Lebesgue density allows us to define the tail of the distribution as the region where is smaller than some threshold. Condition (A.1) ensures that for all sufficiently small, the set is a -dimensional manifold, and P_{X}(\mathcal{R}^{c})\leq\mathbb{P}\bigl{\{}\bar{f}(X)\leq\delta\bigr{\}}, where the latter quantity can be bounded using (A.4). The first part of (A.2) asks for a certain level of smoothness for in a neighbourhood of , and controls the behaviour of its first and second derivatives there relative to the original density. In particular, the greater degree of regularity asked of these derivatives in the tails of the marginal density in (1) allows us still to control the error of a Taylor approximation even in this region. The condition (1) is satisfied by all Gaussian and multivariate- densities, for example, for appropriate choices of and . The last part of (A.2) concerns the behaviour of the marginal feature distribution away from and is often referred to as the strong minimal mass assumption (e.g. Gadat et al., 2016). It requires that the mass of the marginal feature distribution is not concentrated in the neighbourhood of a point and is a rather weaker condition than we ask for on ; in particular, we do not insist that derivatives of exist in this region.
The condition in (A.3) asks for the class conditional densities, when weighted by their respective prior probabilities, to cross at an angle; in particular, this ensures that is a -dimensional, orientable manifold (cf. Section G.3). Moreover, the bounds on the first and second derivatives of in a neighbourhood of ensure that we can estimate sufficiently well. The last part of (A.3) asks that does not approach the critical value of too fast on the complement of . Assumption (A.4) is a simple moment condition that, together with (A.2), ensures that the constants and in (2) below are finite where needed.
Let denote the -dimensional volume form on (cf. Section G.3). Now let
[TABLE]
where
[TABLE]
We are now in a position to present our asymptotic expansion for the global excess risk of the standard -nearest neighbour classifier.
Theorem 1**.**
Fix and such that .
(i) Suppose that and . Then for each ,
[TABLE]
as , uniformly for .
(ii) Suppose that either , or, and . Then for each and each we have
[TABLE]
as , uniformly for .
Theorem 1 reveals an interesting dichotomy: when and , the dominant contribution to the excess risk arises from the difficulty of classifying points close to the Bayes decision boundary . In such settings, the excess risk of the standard -nearest neighbour classifier converges to zero at rate when is chosen proportional to . On the other hand, part (ii) shows that when either or and , the dominant contribution to the excess risk when is large may come from the challenge of classifying points in the tails of the distribution. Indeed, Example 2 below provides one simple setting where this dominant contribution does come from the tails of the distribution.
Example 2**.**
Suppose that the joint density of at is given by , where is a positive, twice continuously differentiable density with for . Suppose also that . Then the corresponding joint distribution belongs to provided is such that is sufficiently large, and is a sufficiently large constant ( and can be chosen arbitrarily). We prove in Appendix C that for every and ,
[TABLE]
as . Thus the rate of convergence in this example is at best , up to subpolynomial factors, whereas a rate of is achievable over any compact set.
The proof of Theorem 1, and indeed the proofs of Theorems 2 and 3 that follow in Section 4 below, depend crucially on Theorem 6.7 in Section 6. This result provides an asymptotic expansion for the excess risk of a general (local or global) -nearest neighbour classifier over a region , where , defined in (7) below, shrinks to zero at a rate slow enough to ensure that concentrates around uniformly over . The intuition regarding the behaviour of the excess risk, then, is that when and is not close to , with high probability the nearest neighbours of are on the same side of as ; i.e. \mathrm{sgn}\bigl{(}\eta(X_{(i)})-1/2\bigr{)}=\mathrm{sgn}\bigl{(}\eta(x)-1/2\bigr{)} for . The probability of classifying differently from the Bayes classifier can therefore be shown to be for every , using Hoeffdingâs inequality. Thus, the challenging regions for classification consist of neighbourhoods of , where is close to , together with , where we no longer enjoy the same nearest neighbour concentration properties. For the first of these regions, we exploit our smoothness assumptions to derive asymptotic expansions for the bias and variance of , uniformly over appropriate neighbourhoods of , and using a normal approximation, we can deduce an asymptotic expansion for the excess risk, uniformly over our classes of distributions and an appropriate set of nearest neighbour classifiers. For we are unable to bound the probability of classifying differently from the Bayes classifier with anything other than a trivial bound, but we can control using (A.4).
Finally in this section, we mention that Samworth (2012) obtained a similar expansion to that in Theorem 1(i) for a fixed distribution satisfying certain smoothness conditions. However, there the risk was computed only over a compact set, so the analysis failed to elucidate the important effects of tail behaviour on the excess risk. Another key difference is that here we define classes , and show that the remainder terms in our asymptotic expansion hold uniformly over these classes; the introduction of these classes further facilitates the study of corresponding minimax lower bounds in Section 5 below.
4 Local--nearest neighbour classifiers
In this section we explore the consequences of a local choice of , compared with the global choice in Theorem 1. Initially, we consider an oracle choice, where is allowed to depend on the marginal feature density (Section 4.1), but we then relax this to semi-supervised settings, where can be estimated from unlabelled training data (Section 4.2).
4.1 Oracle classifier
Suppose for now that the marginal density is known. For and , let
[TABLE]
where the subscript O refers to the fact that this is an oracle choice of the function , since it depends on . This choice aims to balance the local bias and variance of .
Theorem 2**.**
Fix and such that . For each ,
(i) if then for ,
[TABLE]
uniformly for as , where
[TABLE]
(ii) if and , then for every
[TABLE]
uniformly for , as .
Comparing Theorem 2(i) and Theorem 1(i), we see that, unlike for the global -nearest neighbour classifier, we can guarantee a rate of convergence for the excess risk of the oracle classifier, both in low dimensions (), and under a weaker condition on when . In particular, the condition on no longer depends on the dimension of the covariates. The guarantees in Theorem 2(ii) are also stronger than those provided by Theorem 1(ii) for any global choice of . Examining the proof of Theorem 2, we find that the key difference with the proof of Theorem 1 is that we can now choose the region (cf. the discussion of the proof of Theorem 1 in Section 3) to be larger.
4.2 The semi-supervised nearest neighbour classifier
Now consider the more realistic setting where the marginal density of is unknown, but where we have access to an estimate based on the unlabelled training set . Of course, many different techniques are available, but for simplicity, we focus here on a kernel method. Let be a bounded kernel with , , , and let . We further assume that , where is a polynomial and is a function of bounded variation. Now define a kernel density estimator of , given by
[TABLE]
Motivated by the oracle local choice of in (5), for and , let
[TABLE]
Our main result in this setting will require an additional smoothness condition on the marginal feature density in order to ensure that estimates it well. For , and , let denote the class of distributions on whose marginal distribution is absolutely continuous with respect to Lebesgue measure with RadonâNikodym derivative satisfying and
[TABLE]
If , then we define to consist of distributions on whose marginal distribution is again absolutely continuous with RadonâNikodym derivative satisfying , but we now ask that be differentiable, and that
[TABLE]
In Appendix B, we show that the distribution considered in Example 1 belongs to with provided that .
Theorem 3**.**
Fix , , and such that . Let , let and , and let for some .
(i) If and ,
[TABLE]
uniformly for , and , where was defined in Theorem 2(i).
(ii) if and , then for every ,
[TABLE]
uniformly for , and .
Examination of the proof of Theorem 3 reveals that the key property of our kernel estimator of is that there exists such that
[TABLE]
This observation would allow similar results to Theorem 3 to be proved for other versions of the semi-supervised nearest neighbour classifier, with alternative estimators of in the definition of , subject potentially to suitable modifications of the class . It is therefore not our intention to argue that the kernel density approach is superior to other methods of estimating the marginal density .
5 Minimax lower bounds
Our main minimax lower bound is the following:
Theorem 4**.**
Fix , , with increasing for sufficiently small , and . There exist , and , depending only on , such that for , , and with for all , writing , we can find such that for all and all , we have
[TABLE]
where is the unique solution to and the infimum is taken over all measurable functions . In particular, for every , there exists such that
[TABLE]
Remark 5.5**.**
The proof of this result also reveals that the lower bound holds if the classifier is allowed to depend on some unlabelled data or even the true marginal density .
Example 5.6**.**
Consider the case where , so . Then for , we have , so for ,
[TABLE]
Thus, if , then we can take in Theorem 4 to obtain a minimax lower bound of order ; on the other hand, if , then we can take to obtain a minimax lower bound of order , for every . Combining this result with Theorem 3, we see that for every , our semi-supervised local--nearest neighbour classifier attains the minimax optimal rate over the class up to polylogarithmic factors when and up to subpolynomial factors when .
6 Proofs
The proofs of Theorems 1, 2 and 3 rely on the general asymptotic expansion presented in Theorem 6.7 below. We begin with some further notation. Define the matrices and . Write
[TABLE]
and
[TABLE]
Here we have used the fact that the ordered labels are independent given , satisfying . Since takes values in it is clear that for all . Further, write for the unconditional expectation of . Recall also that p_{r}(x)=P_{X}\bigl{(}B_{r}(x)\bigr{)}.
6.1 A general asymptotic expansion
Let
[TABLE]
Further, for , let
[TABLE]
Recall that , and note that by Proposition G.17 in Appendix G, for , we can write
[TABLE]
Let
[TABLE]
and recall the definition of the function in (3).
Theorem 6.7**.**
Fix and such that . For sufficiently large, let \mathcal{R}_{n}\subseteq\bigl{\{}x\in\mathbb{R}^{d}:\bar{f}(x)\geq\delta_{n}(x)\bigr{\}} be a -dimensional manifold. Write for the topological boundary of , let , and let . For and define the class of functions
[TABLE]
Then for each and each with , we have
[TABLE]
as , where with
[TABLE]
and where \limsup_{n\rightarrow\infty}\sup_{P\in\mathcal{P}_{d,\theta}}\sup_{k_{\mathrm{L}}\in K_{\beta,\tau}}|W_{n,2}|/P_{X}\bigl{(}(\partial\mathcal{R}_{n})^{\epsilon_{n}}\cap\mathcal{S}^{\epsilon_{n}}\bigr{)}\leq 1.
Proof 6.8** (Proof of Theorem 6.7).**
First observe that
[TABLE]
The proof is presented in seven steps. We will see that the dominant contribution to the integral in (6.8) arises from a small neighbourhood about the Bayes decision boundary, i.e. the region . On , the nn classifier agrees with the Bayes classifier with high probability (asymptotically). More precisely, we show in Step 4 that
[TABLE]
for each , as . In Steps 1, 2 and 3, we derive the key asymptotic properties of the bias, conditional (on ) bias and variance of respectively. In Step 5 we show that the integral over can be decomposed into an integral over and one perpendicular to . Step 6 is dedicated to combining the results of Steps 1 - 5; we derive the leading order terms in the asymptotic expansion of the integral in (6.8). Finally, we bound the remaining error terms to conclude the proof in Step 7, which is presented in Appendix E. To ease notation, where it is clear from the context, we write in place of .
Step 1: Let , and for and , write . We show that
[TABLE]
uniformly for , , and . Write
[TABLE]
where we show in Step 7 that
[TABLE]
uniformly for , , and .
The density of at is given by
[TABLE]
where and denotes the probability that a random variable equals . Now let
[TABLE]
We show in Step 7 that
[TABLE]
for each , as . It follows from (11) and (13), together with the upper bound on in (A.3) that
[TABLE]
uniformly for , , , and . Similarly, using the upper bound on in (A.3),
[TABLE]
uniformly for , , , and . Hence, summing over , we see that
[TABLE]
uniformly for , , , and , where denotes the probability that a random variable is less than . Let be large enough that
[TABLE]
for . That this is possible follows from the fact that, for ,
[TABLE]
By a Taylor expansion of and assumption (A.2), for all , , and ,
[TABLE]
Hence, for , , and ,
[TABLE]
Now, for , , and ,
[TABLE]
where
[TABLE]
It follows from (6.8) that there exists such that, for all , , and ,
[TABLE]
Similarly, for all and ,
[TABLE]
Hence, by Bernsteinâs inequality, we have that for each ,
[TABLE]
and
[TABLE]
We conclude that
[TABLE]
where
[TABLE]
uniformly for , , and .
Step 2: Recall that . We show that
[TABLE]
uniformly for , , and . Recall that
[TABLE]
Let be large enough that for . Then for , , , and , we have by (A.2) and a very similar argument to that in (6.8) that
[TABLE]
Now suppose that are such that for all , but . We have by (A.2) that
[TABLE]
For each , choose
[TABLE]
Now, given , let , so that . Thus, if there are at least points among inside each of the balls , then for every there are at least of them in . Moreover by (6.8), (19) and (A.2),
[TABLE]
for all , and , say. Define A_{k_{\mathrm{L}}}:=\bigl{\{}\|X_{(k_{\mathrm{L}})}(x)-x\|<\epsilon_{n}/2\ \mbox{for all}\ x\in\mathcal{R}_{n}\cup\mathcal{S}_{n}^{\epsilon_{n}}\bigr{\}}. Then by a standard binomial tail bound (Shorack and Wellner, 1986, Equation (6), p. 440), for and any ,
[TABLE]
uniformly for and . Now, for ,
[TABLE]
It follows that
[TABLE]
as , uniformly for , , and . The claim (18) follows from (6.8) and (21).
Step 3: Â Â In this step, we emphasise the dependence of on by writing it as . We show that
[TABLE]
uniformly for , , and . We will write , considered as a random matrix, so that
[TABLE]
It follows from the EfronâStein inequality (e.g. Boucheron, Lugosi and Massart, 2013, Theorem 3.1) that
[TABLE]
Recall the definition of given in (12). Now observe that, for and all we have that
[TABLE]
uniformly for , , and . The final inequality here follows from similar arguments to those used to bound . Now (22) follows from (6.8) and (6.8).
Step 4: We show that
[TABLE]
for each , as . First, by (A.3) and Proposition G.17 in Section G.2, there exists such that for every , and ,
[TABLE]
Hence, on the event , for and , all of the nearest neighbours of are on the same side of , so
[TABLE]
Now, conditional on , is the sum of independent terms. Therefore, by Hoeffdingâs inequality,
[TABLE]
for every . This completes Step 4.
Step 5: It is now convenient to be more explicit in our notation, by writing . We also let
[TABLE]
Recall that and let
[TABLE]
We show that
[TABLE]
uniformly for and , and that for all ,
[TABLE]
Now by Proposition G.19 in Section G.2, for , the map is a diffeomorphism from to , where
[TABLE]
Furthermore, for such , and , . It follows from this and (G.3) in Section G.3 that
[TABLE]
where is defined in (57) in Section G.2, and as , uniformly for , and . Now observe that and . We deduce from this and the definition of that (25) holds.
Step 6: The last step in the main argument is to show that
[TABLE]
as , uniformly for and . First observe that
[TABLE]
uniformly for and . Now, write Note that, given , is the sum of independent Bernoulli variables, satisfying . Let be the standard normal distribution function, and let
[TABLE]
We can write
[TABLE]
where we show in Step 7 that
[TABLE]
uniformly for and . Then, substituting , we see that
[TABLE]
uniformly for , and . The conclusion follows by integrating with respect to over .
Step 7: It remains to bound the error terms and â these bounds are presented in Appendix E.
6.2 Proof of Theorem 1
Proof 6.9** (Proof of Theorem 1).**
Let , and note that since is constant, we have that c_{n}=\ell\bigl{(}k/(n-1)\bigr{)}, and Now let
[TABLE]
and observe that by Berrett et al. (2019, Lemma 10(i)), for ,
[TABLE]
It follows that we can find be large enough that is non-empty for all , and , so that, by Assumption (A.1), for it is an open subset of , and therefore a -dimensional manifold. Let ,
[TABLE]
and
[TABLE]
Recalling the definition of in (8), for , we may apply Theorem 6.7 with for all to deduce that
[TABLE]
where and where
[TABLE]
We now show that, under the conditions of part (i), and are well approximated by integrals over the whole of the manifold , and that these integrals are uniformly bounded. Given , define \epsilon_{0}(x_{0}):=\min\bigl{\{}1,\frac{\epsilon_{0}\log 2}{2d},\frac{1}{4\ell(\bar{f}(x_{0}))}\bigr{\}}. Then for any we have by (A.2) and CauchyâSchwarz that
[TABLE]
Moreover, writing for the eigenvalues of the matrix defined in (57), for , we have
[TABLE]
so . Hence, for any there exists such that, writing , by (G.3), Hölderâs inequality and (A.4), we have
[TABLE]
Now, by Assumption (A.3), for any ,
[TABLE]
Moreover, writing ,
[TABLE]
By Assumptions (A.2), (A.3), (6.9) and the fact that , we have, writing , that
[TABLE]
Similarly,
[TABLE]
A similar argument shows that \gamma_{n}(k)=O\bigl{(}1/k+(k/n)^{4/d}\bigr{)}, uniformly for and .
Finally, we bound P_{X}\bigl{(}(\partial\mathcal{R}_{n})^{\epsilon_{n}}\cap\mathcal{S}^{\epsilon_{n}}\bigr{)} and . Suppose that . Then there exists with . By Assumption (A.2) we have that
[TABLE]
Thus there exists such that for . By the moment assumption in (A.4) and Hölderâs inequality, observe that for any , , and ,
[TABLE]
uniformly for . Moreover,
[TABLE]
so the same bound (6.9) applies. Since and was arbitrary, this completes the proof of part (i).
For part (ii), in contrast to part (i), the dominant contribution to the excess risk could now arise from the tail of the distribution. First, as in part (i), we have , uniformly for and . Furthermore, using Assumption (A.3), (6.9) and the fact that , we see that, for any ,
[TABLE]
for every \epsilon\in\bigl{(}\epsilon^{\prime},\rho/(\rho+d)\bigr{]}, uniformly for and , where the final conclusion follows from the fact that is bounded. We can also bound by the same argument, so the result follows in the same way as in part (i).
6.3 Proofs of results from Section 4
Proof 6.10** (Proof of Theorem 2).**
Recall that
[TABLE]
and define
[TABLE]
where c_{n}:=\sup_{x_{0}\in\mathcal{S}:\bar{f}(x_{0})\geq k_{\mathrm{O}}(x_{0})/(n-1)}\ell\bigl{(}\bar{f}(x_{0})\bigr{)}. For let
[TABLE]
Then there exists such that for we have \mathcal{R}_{n}\subseteq\bigl{\{}x\in\mathbb{R}^{d}:\bar{f}(x)\geq\delta_{n,\mathrm{O}}(x)\bigr{\}} for all and , and by Assumption (A.1) and (27), we then have that is a -dimensional manifold. There exists such that for all , , and we have that k_{\mathrm{O}}(x)=\bigl{\lfloor}B\bigl{\{}\bar{f}(x)(n-1)\bigr{\}}^{4/(d+4)}\bigr{\rfloor}. By (A.2), we therefore have that for some (which does not depend on or ) with .
By a similar argument to that in (29), there exists such that for , , and , we have . But, by Markovâs inequality and Hölderâs inequality, for and any ,
[TABLE]
Thus, if , then we can choose and in (6.10) to conclude that
[TABLE]
Moreover, writing
[TABLE]
by very similar arguments to those given in the proof of Theorem 1, and as , both uniformly for and . The proof of part (i) therefore follows from Theorem 6.7.
On the other hand, if , then choosing both and to be sufficiently small, we find from (6.10) that
[TABLE]
for every , uniformly for and . After another application of Theorem 6.7, this proves part (ii).
Proof 6.11** (Proof of Theorem 3).**
We prove parts (i) and (ii) of the theorem simultaneously, by appealing to the corresponding arguments in the proof of Theorem 2. First, as in the proof of Theorem 2, for \alpha\in\bigl{(}(1+d/4)\beta,1\bigr{)}, we define and introduce the following class of functions: for , let
[TABLE]
Let . We first show that with high probability. For ,
[TABLE]
Now
[TABLE]
To bound the first term in (32), by Giné and Guillou (2002, Corollary 2.2), there exist , such that
[TABLE]
for all s\in\Bigl{[}\frac{C\|\bar{f}\|_{\infty}^{1/2}R(K)^{1/2}}{A^{d/2}}\log^{1/2}\Bigl{(}\frac{\|K\|_{\infty}m^{d/(2(d+2\gamma))}}{\|\bar{f}\|_{\infty}^{1/2}A^{d/2}R(K)^{1/2}}\Bigr{)},\frac{C\|\bar{f}\|_{\infty}R(K)m^{\gamma/(d+2\gamma)}}{\|K\|_{\infty}}\Bigr{]} and .
Recall that for , we have and also satisfies the lower bound in (27). Hence, by applying the bound in (33) with , since , we have that there exists , not depending on or such that for ,
[TABLE]
for all , uniformly for . For the second term in (32), by a Taylor expansion, we have that for all and ,
[TABLE]
It follows that, writing , we have
[TABLE]
for all .
Now, for , let
[TABLE]
Let c_{n}:=\sup_{x_{0}\in\mathcal{S}:\bar{f}(x_{0})\geq k_{\tilde{f}}(x_{0})/(n-1)}\ell\bigl{(}\bar{f}(x_{0})\bigr{)}, and let
[TABLE]
Then there exists such that for and , we have \mathcal{R}_{n}\subseteq\bigl{\{}x\in\mathbb{R}^{d}:\bar{f}(x)\geq\delta_{n,\tilde{f}}(x)\bigr{\}} and . We can therefore apply Theorem 6.7 (similarly to the application in the proof of Theorem 2) to conclude that for every ,
[TABLE]
uniformly for and , where was defined in the proof of Theorem 2. The proof of both parts (i) and (ii) is now completed by following the relevant steps in the proof of Theorem 2.
Acknowledgements
The authors are grateful to the anonymous reviewers, whose constructive comments helped to improve the paper. We would also like to thank the Isaac Newton Institute for Mathematical Sciences for support and hospitality during the programme âStatistical Scalabilityâ when work on this paper was undertaken. This work was supported by EPSRC grant number EP/R014604/1.
Appendix A The relationship between our classes and the margin assumption
Recall from Mammen and Tsybakov (1999) that a distribution on with marginal on and regression function satisfies a margin assumption with parameter if there exists such that
[TABLE]
for all sufficiently small . The following lemma clarifies the relationship between our classes and the margin assumption.
Lemma A.12**.**
Let for some . Then satisfies a margin assumption with parameter .
Proof A.13**.**
By the final part of (A.3), we have
[TABLE]
Now, by Proposition G.17 in Section G.2, for , there exists and such that . Thus, by a Taylor expansion,
[TABLE]
We deduce as in Step 5 of the proof of Theorem 6.7 that there exists such that for all ,
[TABLE]
where the final bound follows from (6.9) in the main text. For the second term in (A.13), we exploit the fact that since , there exists such that for all . Hence, arguing as in (6.9) in the main text, we find that
[TABLE]
The result follows from (A.13), (A.13) and (A.13).
Appendix B Example 1 from the main text
Recall that we consider the distribution on for which and . Since is continuous on all of , it is clear that (A.1) is satisfied.
Now, and clearly is non-empty. For all we have that . Since we have that and thus is twice continuously differentiable on . Differentiating twice on , we have that and
[TABLE]
Thus, for , we have . We also have that, for any ,
[TABLE]
so that for any . Finally for (A.2) we consider the cases and separately. If then, for , at least a proportion of the ball is closer to the origin than , and thus has larger density. This gives us that, for such and , . When and we instead have that
[TABLE]
We now turn to condition (A.3). First, for any we have that . For we have that and . Since is constant on it is trivially true that
[TABLE]
for any . Now for we have that
[TABLE]
Since the support of is equal to , we have that , so (A.4) is satisfied.
We finally check (A.5) to show that for . First, it is clear that . Now, for any we have that
[TABLE]
Appendix C Example 2 from the main text
Proof C.14** (Proof of claim in Example 2).**
Fix and , let
[TABLE]
and for , let
[TABLE]
Now, for and ,
[TABLE]
where , ,
[TABLE]
Therefore, there exists such that and for all , and . It follows by Bernsteinâs inequality that for every .
Now, for , and , we have that
[TABLE]
Our next observation is that for and such that , we have that , where the pairs are independent and identically distributed, and then is a reordering such that . Here and . Writing we therefore have by Hoeffdingâs inequality that, for , and ,
[TABLE]
for all , uniformly for . Writing for the marginal distribution of , we deduce that
[TABLE]
for all , uniformly for . We conclude that for every ,
[TABLE]
uniformly for . The claim (4) follows from this together with Theorem 1(ii).
Appendix D Proof of Theorem 4
Proof D.15** (Proof of Theorem 4).**
For an integer and , define a grid on by
[TABLE]
Now, for , let be the closest point to among those in (if there are multiple points, pick the one that is smallest in the lexicographic ordering). Let and define closed Euclidean balls in of radius , where the th ball is centered at the th grid point in the lexicographic ordering.
Writing for the closest integer to (where we round half-integers to the nearest even integer), define the âsaw-toothâ function , by \eta_{0}(x):=3/8+\bigl{|}x_{1}+1/4-[x_{1}+1/4]\bigr{|}/2, for . Further, for , set u(x):=\frac{\alpha_{0}g^{-1}(1/q)}{q^{2}}\bigl{(}1/4-q^{2}\|x-n_{q}(x)\|^{2}\bigr{)}^{4}, where .
For , we now define the distribution on by setting the regression function to be , for , , and setting , otherwise. To define the marginal distribution on induced by , which will be the same for each , we first define the boxes and for and some to be chosen later. We further define a modified bump function by
[TABLE]
where denotes the standard normal distribution function. For we then set
[TABLE]
for some to be specified later. Here, in the definition of is chosen such that , and we note that
[TABLE]
so .
Let
[TABLE]
We show below that for all and satisfying the conditions of the theorem.
Letting denote expectation with respect to and writing for , we have that, for any classifier ,
[TABLE]
Now let for , and , and define the distribution on by , for and otherwise (the marginal distribution on is again taken to be ). We write to denote expectation with respect to .
For and define
[TABLE]
By the RadonâNikodym theorem, we have that
[TABLE]
Now fix , and writing and as shorthand, observe that
[TABLE]
Here we used the fact that , so , and that the minimum is attained by taking for ; it is interesting to note that this remains the optimal classifier even if is known. Moreover, whenever , we have , and when , we have . It follows that
[TABLE]
where and .
Now, observe that
[TABLE]
and
[TABLE]
Moreover, using the fact that for , we have that
[TABLE]
We now turn to finding a lower bound for the integral in (D.15). First, we observe that \mathrm{sgn}\bigl{(}\tilde{u}(x)-x_{1}\bigr{)}=\mathrm{sgn}\bigl{(}1/2-\tilde{\eta}(x)\bigr{)}, and moreover for and , we have that
[TABLE]
Thus
[TABLE]
Furthermore, for , writing , we have that if and only if
[TABLE]
which is satisfied if
[TABLE]
Now is real if . Moreover, for x_{1}\in\bigl{[}0,\frac{\alpha_{0}g^{-1}(1/q)}{2^{14}q^{2}}\bigr{]}. We also require the observation that when x_{1}\in\bigl{[}0,\frac{\alpha_{0}g^{-1}(1/q)}{2^{14}q^{2}}\bigr{]} and . Hence
[TABLE]
We have therefore shown that, for ,
[TABLE]
where . It follows that if we set
[TABLE]
and choose to satisfy , then
[TABLE]
It remains to show that belongs to the desired classes for each . First note that
[TABLE]
Condition (A.1) is satisfied by by construction. To verify the minimal mass assumption, we take , and observe that when ,
[TABLE]
as required. It follows that (A.2) is satisfied for such and for any .
The main condition to check is (A.3). For , consider
[TABLE]
Then
[TABLE]
and
[TABLE]
From these calculations, we see that each is twice continuously differentiable on , with for all and . We have that, when ,
[TABLE]
Hence, using the fact that is increasing for sufficiently small , we have that for sufficiently large ,
[TABLE]
Now consider the case where and with , so that . Let denote the closest point in to on the line segment joining to , and similarly let denote the closest point in to on the same line segment. Then , so, by (D.15),
[TABLE]
We therefore deduce that
[TABLE]
For the final part of (A.3), we note that
[TABLE]
Finally, we check the moment condition in (A.4). First,
[TABLE]
say. We conclude that there exists such that for and any , we have for with any , , , any with and any .
Finally, we note that and
[TABLE]
Hence for .
Appendix E Proof of Theorem 5 (continued)
Proof E.16** (Proof of Theorem 6.7 â Step 7).**
To complete the proof of Theorem 6.7, it remains to bound the error terms and .
To bound : We have
[TABLE]
By a Taylor expansion and (A.3), for all , and ,
[TABLE]
Hence
[TABLE]
Now, by similar arguments to those leading to (6.8), we have that
[TABLE]
uniformly for , , and . Moreover, for every ,
[TABLE]
uniformly for , , and , by (16) in Step 1. For the remaining terms, note that
[TABLE]
Let t_{0}=t_{0}(x):=5^{2/\rho}(1+2^{\rho-1})^{2/\rho}\bigl{(}M_{0}+\|x\|^{\rho}\bigr{)}^{2/\rho}. Then, for , we have
[TABLE]
It follows by Bennettâs inequality that for ,
[TABLE]
But, when and ,
[TABLE]
We deduce that for every ,
[TABLE]
Moreover, by Bernsteinâs inequality, for every ,
[TABLE]
We conclude from (6.8), (E.16), (40), (41), (E.16), (43) and (44), together with Jensenâs inequality to deal with the third term on the right-hand side of (E.16), that (10) holds. With only simple modifications, we have also shown (13), which bounds .
To bound : Write
[TABLE]
Now by a non-uniform version of the BerryâEsseen theorem (Paditz, 1989, Theorem 1), for every and ,
[TABLE]
Let
[TABLE]
where
[TABLE]
In the following we integrate the bound in (45) over the regions and separately. Define the event
[TABLE]
so that, by very similar arguments to those used to bound in Step 2, we have for every , uniformly for and . It follows by (45) and Step 2 that there exists such that for all , and ,
[TABLE]
By Step 1, there exists such that for , , , and ,
[TABLE]
Thus for , , , and , we have that
[TABLE]
It follows by (45), (E.16) and Step 3 that, for ,
[TABLE]
uniformly for , and . We conclude from (E.16) and (E.16) that , uniformly for and .
To bound : Let . Write
[TABLE]
where
[TABLE]
and
[TABLE]
To bound : We again deal with the regions and separately. First let . Writing for the standard normal density, and using the facts that , that and have the same sign, and that , we have
[TABLE]
uniformly for , and . Note that for and , we have when and that
[TABLE]
Thus by (E.16), (E.16), (E.16) and Step 3, for and ,
[TABLE]
uniformly for , and .
To bound : Let
[TABLE]
Given small enough that , by Step 1 there exists such that for , , , and ,
[TABLE]
By decreasing and increasing if necessary, it follows that
[TABLE]
for all , , , and satisfying 2\epsilon u(x_{0})\ell\bigl{(}\bar{f}(x_{0})\bigr{)}\|\dot{\eta}(x_{0})\|\leq|\bar{\theta}(x_{0},t)|. Substituting , it follows that there exists such that for all , and ,
[TABLE]
The combination of (E.16) and (E.16) yields the desired error bound on in (26), uniformly for , , and therefore completes the proof.
Appendix F Empirical analysis
In this section, we compare the nn and nn classifiers, introduced in Section 4 of the main text, with the standard nn classifier studied in Section 3 of the main text. We investigate three settings that reflect the differences between the main results in these sections.
- âą
Setting 1: is the distribution of independent components; whereas is the distribution of independent components.
- âą
Setting 2: is the distribution of independent components; is the distribution of independent components, the first having a distribution and the remainder having a distribution.
- âą
Setting 3: is the distribution of independent standard Cauchy components; is the distribution of independent components, the first being standard Cauchy and the remainder standard normal.
The corresponding marginal distribution in Setting 1 satisfies (A.4) for every . Hence, for the standard -nearest neighbour classifier when , we are in the setting of Theorem 1(i), while for , we can only appeal to Theorem 1(ii). On the other hand, for the local--nearest neighbour classifiers, the results of Theorems 2(i) and 3(i) apply for all dimensions, and we can expect the excess risk to converge to zero at rate . In Setting 2, (A.4) holds for , but not for . Thus, for the standard -nearest neighbour classifier, we are in the setting of Theorem 1(ii) for , whereas Theorems 2(i) and 3(i) again apply for all dimensions for the local classifiers. Finally, in Setting 3, (A.4) does not hold for any , and only the conditions of Theorems 1(ii), 2(ii) and 3(ii) apply.
For the standard nn classifier, we use 5-fold cross validation to choose , based on a sequence of equally-spaced values between 1 and of length at most 40. For the oracle classifier, we set
[TABLE]
where was again chosen via 5-fold cross validation, but based on a sequence of 40 equally-spaced points between (corresponding to the 1-nearest neighbour classifier) and . Similarly, for the semi-supervised classifier, we set
[TABLE]
where was chosen analogously to , and where is the -dimensional kernel density estimator constructed using a truncated normal kernel and bandwidths chosen via the default method in the R package ks (Duong, 2015). In practice, we estimated by the maximum value attained on the unlabelled training set.
In each of the three settings above, we generated a training set of size in dimensions , an unlabelled training set of size 1000, and a test set of size 1000. In Table 1, we present the sample mean and standard error (in subscript) of the risks computed from 1000 repetitions of each experiment. Further, we present estimates of the regret ratios, given by
[TABLE]
for which the standard errors given are estimated via the delta method. From Table 1, we saw improvement in performance from the oracle and semi-supervised classifiers in 22 of the 27 experiments, comparable performance in three experiments, and there were two where the standard nn classifier was the best of the three classifiers considered. In those latter two cases, the theoretical improvement expected for the local classifiers is small; for instance, when in Setting 2, the excess risk for the local classifiers converges at rate , while the standard -nearest neighbour classifier can attain a rate at least as fast as for every . It is therefore perhaps unsurprising that we require the larger sample size of for the local classifiers to yield an improvement in this case. The semi-supervised classifier exhibits similar performance to the oracle classifier in all settings, though some deterioration is noticeable in higher dimensions, where it is harder to construct a good estimate of from the unlabelled training data.
Appendix G An introduction to differential geometry, tubular neighbourhoods and integration on manifolds
The purpose of this section is to give a brief introduction to the ideas from differential geometry, specifically tubular neighbourhoods and integration on manifolds, which play an important role in our analysis of misclassification error rates, but which we expect are unfamiliar to many statisticians. For further details and several of the proofs, we refer the reader to the many excellent texts on these topics, e.g. Guillemin and Pollack (1974), Gray (2004).
G.1 Manifolds and regular values
Recall that if is an arbitrary subset of , we say is differentiable if for each , there exists an open subset containing and a differentiable function such that for . If is also a subset of , we say is a diffeomorphism if is bijective and differentiable and if its inverse is also differentiable. We then say is an -dimensional manifold if for each , there exist an open subset , a neighbourhood of in and a diffeomorphism . Such a diffeomorphism is called a local parametrisation of around , and we sometimes suppress the dependence of and on . It turns out that the specific choice of local parametrisation is usually not important, and properties of the manifold are well-defined regardless of the choice made.
Let be an -dimensional manifold and let be a local parametrisation of around , where is an open subset of . Assume that for convenience. The tangent space to at is defined to be the image of the derivative of at [math]. Thus is the -dimensional subspace of whose parallel translate is the best affine approximation to through , and is well-defined as a map from to . If is differentiable, we define the derivative of at by , where .
In practice, it is usually rather inefficient to define manifolds through explicit diffeomorphisms. Instead, we can often obtain them as level sets of differentiable functions. Suppose that is a manifold and is differentiable. We say is a regular value for if for every for which . If is a regular value of , then is a -dimensional submanifold of (Guillemin and Pollack, 1974, p. 21).
G.2 Tubular neighbourhoods of level sets
For any set and , we call the -neighbourhood of . In circumstances where is a -dimensional manifold defined by the level set of a continuously differentiable function with non-vanishing derivative on , the set is often called a tubular neighbourhood, and for all and . We therefore have the following useful representation of the -neighbourhood of in terms of points on and a perturbation in a normal direction.
Proposition G.17**.**
Let , suppose that is non-empty, and suppose further that is continuously differentiable on for some , with for all , so that is a -dimensional manifold. Then
[TABLE]
Proof G.18**.**
For any and , we have . On the other hand, suppose that . Since is closed, there exists such that for all . Rearranging this inequality yields that, for ,
[TABLE]
Let be an open subset of and be a local parametrisation of around , where without loss of generality we assume . Let be given and let be such that . Then for sufficiently small we have , so by (54),
[TABLE]
Letting we see that . Since was arbitrary and , we therefore have that for all . Moreover, for all , so , which yields the result.
In fact, under a slightly stronger condition on , we have the following useful result:
Proposition G.19**.**
Let be a -dimensional manifold in , suppose that satisfies the condition that is non-empty. Suppose further that there exists such that is twice continuously differentiable on . Assume that for all . Define by
[TABLE]
If
[TABLE]
then is injective. In fact is a diffeomorphism, with
[TABLE]
for and , where
[TABLE]
Proof G.20**.**
Assume for a contradiction that there exist distinct points and with such that
[TABLE]
Then
[TABLE]
By Taylorâs theorem and (58),
[TABLE]
contradicting the hypothesis (55).
To show that is a diffeomorphism, let be given and let be a local parametrisation around with . Define by , and by . Finally, define the Gauss map by . Then, for and ,
[TABLE]
where is given in (56).
To show that is invertible, note that for and ,
[TABLE]
where the final inequality follows from (55). Then, since v_{1}+\frac{t}{\|\dot{\eta}(x_{0})\|}\Bigl{(}I-\frac{\dot{\eta}(x_{0})\dot{\eta}(x_{0})^{T}}{\|\dot{\eta}(x_{0})\|^{2}}\Bigr{)}\ddot{\eta}(x_{0})v_{1} and are orthogonal, it follows that is indeed invertible. The inverse function theorem (e.g. Guillemin and Pollack, 1974, p. 13) then gives that is a local diffeomorphism, and moreover, by Guillemin and Pollack (1974, Exercise 5, p. 18) and the fact that is bijective, we can conclude that is in fact a diffeomorphism.
G.3 Forms, pullbacks and integration on manifolds
Let be a (real) vector space of dimension . We say is a -tensor on if it is -linear, and write for the set of -tensors on . If and , we define their tensor product by
[TABLE]
Let denote the set of permutations of . If and , we can define by for . We say is alternating if for all transpositions . The set of alternating -tensors on , denoted , is a vector space of dimension . The function is defined by
[TABLE]
where denotes the sign of the permutation . If and , we define their wedge product by
[TABLE]
If is another (real) vector space and is a linear map, we define the transpose of by
[TABLE]
Let be a manifold. A -form on is a function which assigns to each an element . If is a -form on and is a -form on , we can define their wedge product by . For , let denote the coordinate function . These functions induce -forms , given by (so in our previous notation). Letting , for , we write
[TABLE]
It turns out (Guillemin and Pollack, 1974, p. 163) that any -form on an open subset of can be uniquely expressed as
[TABLE]
where each is a real-valued function on .
Recall that the set of all ordered bases of a vector space is partitioned into two equivalence classes, and an orientation of is simply an assignment of a positive sign to one equivalence class and a negative sign to the other. If and are oriented vector spaces in the sense that an orientation has been specified for each of them, then an isomorphism always either preserves orientation in the sense that for any ordered basis of , the ordered basis has the same sign as , or it reverses it. We say an -dimensional manifold is orientable if for every , there exist an open subset of , a neighbourhood of in and a diffeomorphism such that preserves orientation for every . A map like above whose derivative at every point preserves orientation is called an orientation-preserving map.
If and are manifolds, is a -form on and is differentiable, we define the pullback of by to be the -form on given by
[TABLE]
If is an -dimensional vector space and is linear, then for all (Guillemin and Pollack, 1974, p. 160).
If is an -form on an open subset of , then by (59), we can write . If is an integrable form on (i.e. is an integrable function on ), we can define the integral of over by
[TABLE]
where the integral on the right-hand side is a usual Lebesgue integral. Now let be an -dimensional orientable manifold that can be parametrised with a single chart, in the sense that there exists an open subset of and an orientation-preserving diffeomorphism . Define the support of an -form on to be the closure of . If is compactly supported, then its pullback is a compactly supported -form on ; moreover is integrable, and we can define the integral over of by
[TABLE]
Alternatively, we can suppose that is non-negative and measurable in the sense that , say, with non-negative and measurable on . In this case, we can also define the integral of over via (60).
More generally, integrals of forms over more complicated manifolds can be defined via partitions of unity. Recall (Guillemin and Pollack, 1974, p. 52) that if is an arbitrary subset of , and is a (relatively) open cover of , then there exists a sequence of real-valued, differentiable functions on , called a partition of unity with respect to , with the following properties:
for all ; 2. 2.
Each has a neighbourhood on which all but finitely many functions are identically zero; 3. 3.
Each is identically zero except on some closed set contained in some ; 4. 4.
for all .
Now let be an -dimensional, orientable manifold, so for each , there exist an open subset of , a neighbourhood of in and an orientation-preserving diffeomorphism . If is a compactly supported -form on and denotes a partition of unity on with respect to , we can define the integral of over by
[TABLE]
In fact, writing for the compact support of , we can find a neighbourhood of , and a finite subset of such that are identically zero on , and such that
[TABLE]
Thus the integral can be written as a finite sum. Similarly, if is a non-negative -form on , we can again define the integral of over via (61). Finally, if is an integrable -form on , the integral can be defined by taking positive and negative parts in the usual way.
In our work, we are especially interested in integrals of a particular type of form. Given an -dimensional, orientable manifold in , the volume form is the unique -form on such that at each , the alternating -tensor on gives value to each positively oriented orthonormal basis for . For example, when , we have , provided we consider the standard basis to be positively oriented. As another example, if is a -dimensional manifold and is continuously differentiable with non-empty and for , then is a -dimensional, orientable manifold (Guillemin and Pollack, 1974, Exercise 18, p. 106). If we say that an ordered, orthonormal basis for is positively oriented whenever , we have that
[TABLE]
where denotes the th coordinate function. We now define an ordered, orthonormal basis for to be positively oriented. Further, we define a -form and a -form on by
[TABLE]
Then, with defined as in Proposition G.19, and under the conditions of that proposition,
[TABLE]
so . It follows that if is either compactly supported and integrable, or non-negative and measurable, then
[TABLE]
We also require the change of variables formula: if and are orientable manifolds and are of dimension , and if is an orientation-preserving diffeomorphism, then
[TABLE]
for every compactly supported, integrable -form on (Guillemin and Pollack, 1974, p. 168). In particular, if is either compactly supported and integrable, or non-negative and measurable, then writing , we have from (62) and (63) that
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abramson (1982) Abramson, I. S. (1982) On bandwidth estimation in kernel estimates â a square root law. Ann. Statist. , 10 , 1217â1223.
- 2Audibert and Tsybakov (2007) Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. , 35 , 608â633.
- 3Berrett and Samworth (2019 a) Berrett, T. B. and Samworth, R. J. (2019 a) Efficient two-sample functional estimation and the super-oracle phenomenon. https://arxiv.org/abs/1904.09347 .
- 4Berrett and Samworth (2019 b) Berrett, T. B. and Samworth, R. J. (2019 b) Nonparametric independence testing via mutual information. Biometrika , to appear.
- 5Berrett et al. (2019) Berrett, T. B., Samworth, R. J. and Yuan, M. (2019). Efficient multivariate entropy estimation via k đ k -nearest neighbour distances. Ann. Statist. , 47 , 288â318.
- 6Biau et al. (2010) Biau, G., CĂ©rou, F. and Guyader, A. (2010). On the rate of convergence of the bagged nearest neighbor estimate. J. Mach. Learn. Res. , 11 , 687â712.
- 7Biau and Devroye (2015) Biau, G. and Devroye, L. (2015). Lectures on the Nearest Neighbor Method . Springer, New York.
- 8Boucheron et al. (2005) Boucheron, S., Bousquet, O. and Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: PS , 9 , 323â375.
