Local nearest neighbour classification with applications to   semi-supervised learning

Timothy I. Cannings; Thomas B. Berrett; Richard J. Samworth

arXiv:1704.00642·math.ST·May 21, 2019

Local nearest neighbour classification with applications to semi-supervised learning

Timothy I. Cannings, Thomas B. Berrett, Richard J. Samworth

PDF

TL;DR

This paper introduces a new asymptotic analysis of local-$k$-nearest neighbour classifiers, revealing conditions for optimal excess risk rates and proposing a semi-supervised variant that adapts to feature density estimates.

Contribution

It derives an asymptotic expansion for the excess risk of local-$k$-NN, proposes a semi-supervised classifier with improved convergence rates, and establishes minimax optimality of the method.

Findings

01

Achieves an $O(n^{-4/(d+4)})$ excess risk rate under certain conditions.

02

Semi-supervised local-$k$-NN attains minimax optimal rates.

03

Simulation confirms theoretical advantages.

Abstract

We derive a new asymptotic expansion for the global excess risk of a local- $k$ -nearest neighbour classifier, where the choice of $k$ may depend upon the test point. This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the decision boundary of the optimal Bayes classifier, but we also show that if these conditions are not satisfied, then the dominant contribution may arise from the tails of the marginal distribution of the features. Moreover, we prove that, provided the $d$ -dimensional marginal distribution of the features has a finite $ρ$ th moment for some $ρ > 4$ (as well as other regularity conditions), a local choice of $k$ can yield a rate of convergence of the excess risk of $O (n^{- 4/ (d + 4)})$ , where $n$ is the sample size, whereas for the standard $k$ -nearest neighbour classifier, our theory would require $d \geq 5$ and…

Tables1

Table 1. Table 1: Misclassification rates for Settings 1, 2 and 3. In the final two columns we present the regret ratios given in ( 53 ) (with standard errors calculated via the delta method).

$d$	Bayes risk	$n$	$\hat{k}$ nn risk	${\hat{k}}_{O}$ nn risk	${\hat{k}}_{SS}$ nn risk	O RR	SS RR
Setting 1
1	22.67	50	${26.85}_{0.13}$	${25.91}_{0.12}$	${25.98}_{0.13}$	${0.78}_{0.022}$	${0.79}_{0.023}$
		200	${24.07}_{0.06}$	${23.52}_{0.06}$	${23.48}_{0.05}$	${0.61}_{0.030}$	${0.58}_{0.029}$
		1000	${23.20}_{0.04}$	${22.93}_{0.04}$	${22.94}_{0.04}$	${0.48}_{0.048}$	${0.50}_{0.048}$
2	13.30	50	${17.70}_{0.09}$	${16.96}_{0.08}$	${16.95}_{0.08}$	${0.83}_{0.015}$	${0.83}_{0.015}$
		200	${15.09}_{0.05}$	${14.69}_{0.04}$	${14.74}_{0.05}$	${0.77}_{0.018}$	${0.80}_{0.019}$
		1000	${14.04}_{0.04}$	${13.78}_{0.03}$	${13.80}_{0.03}$	${0.65}_{0.025}$	${0.67}_{0.025}$
5	3.53	50	${9.46}_{0.07}$	${8.95}_{0.06}$	${8.94}_{0.06}$	${0.91}_{0.006}$	${0.91}_{0.006}$
		200	${6.94}_{0.03}$	${6.67}_{0.03}$	${6.70}_{0.03}$	${0.92}_{0.006}$	${0.93}_{0.007}$
		1000	${5.49}_{0.02}$	${5.18}_{0.02}$	${5.23}_{0.02}$	${0.84}_{0.008}$	${0.87}_{0.008}$
Setting 2
1	31.16	50	${36.55}_{0.14}$	${36.07}_{0.14}$	${35.93}_{0.14}$	${0.91}_{0.020}$	${0.88}_{0.020}$
		200	${32.93}_{0.08}$	${32.38}_{0.07}$	${32.42}_{0.07}$	${0.69}_{0.031}$	${0.71}_{0.032}$
		1000	${31.62}_{0.05}$	${31.37}_{0.05}$	${31.37}_{0.05}$	${0.46}_{0.065}$	${0.47}_{0.066}$
2	31.15	50	${37.79}_{0.13}$	${38.02}_{0.12}$	${37.90}_{0.12}$	${1.02}_{0.014}$	${1.01}_{0.015}$
		200	${33.64}_{0.08}$	${33.63}_{0.07}$	${33.54}_{0.07}$	${1.00}_{0.028}$	${0.96}_{0.026}$
		1000	${31.83}_{0.05}$	${31.81}_{0.05}$	${31.80}_{0.05}$	${0.97}_{0.039}$	${0.95}_{0.038}$
5	20.10	50	${28.74}_{0.12}$	${29.16}_{0.12}$	${29.13}_{0.11}$	${1.05}_{0.011}$	${1.05}_{0.011}$
		200	${23.60}_{0.06}$	${23.75}_{0.06}$	${23.93}_{0.06}$	${1.04}_{0.014}$	${1.09}_{0.015}$
		1000	${21.86}_{0.04}$	${21.71}_{0.04}$	${21.77}_{0.04}$	${0.91}_{0.014}$	${0.95}_{0.014}$
Setting 3
1	37.44	50	${44.76}_{0.10}$	${43.09}_{0.12}$	${43.08}_{0.12}$	${0.77}_{0.013}$	${0.77}_{0.013}$
		200	${41.86}_{0.08}$	${40.18}_{0.09}$	${40.23}_{0.09}$	${0.62}_{0.017}$	${0.63}_{0.017}$
		1000	${38.68}_{0.06}$	${37.85}_{0.05}$	${37.89}_{0.05}$	${0.33}_{0.033}$	${0.36}_{0.032}$
2	37.45	50	${46.20}_{0.09}$	${44.81}_{0.10}$	${45.24}_{0.10}$	${0.84}_{0.009}$	${0.89}_{0.009}$
		200	${43.50}_{0.07}$	${42.29}_{0.08}$	${42.86}_{0.08}$	${0.80}_{0.011}$	${0.89}_{0.011}$
		1000	${40.53}_{0.06}$	${39.64}_{0.06}$	${39.96}_{0.06}$	${0.71}_{0.013}$	${0.82}_{0.014}$
5	23.23	50	${41.56}_{0.11}$	${38.13}_{0.11}$	${39.26}_{0.12}$	${0.81}_{0.005}$	${0.87}_{0.005}$
		200	${36.02}_{0.07}$	${33.34}_{0.06}$	${34.68}_{0.07}$	${0.79}_{0.004}$	${0.90}_{0.004}$
		1000	${31.46}_{0.05}$	${29.91}_{0.05}$	${30.58}_{0.05}$	${0.81}_{0.004}$	${0.89}_{0.004}$

Equations634

R_{R} (C) := P [{C (X) \neq = Y} \cap {X \in R}] .

R_{R} (C) := P [{C (X) \neq = Y} \cap {X \in R}] .

C^{\mathrm{Bayes}}(x):=\left\{\begin{array}[]{ll}1&\mbox{if $\eta(x)\geq 1/2$};\\ 0&\mbox{otherwise},\end{array}\right.

C^{\mathrm{Bayes}}(x):=\left\{\begin{array}[]{ll}1&\mbox{if $\eta(x)\geq 1/2$};\\ 0&\mbox{otherwise},\end{array}\right.

\hat{C}_{n}^{k_{\mathrm{L}}\mathrm{nn}}(x):=\left\{\begin{array}[]{ll}1&\mbox{if $\hat{S}_{n}(x)\geq 1/2$};\\ 0&\mbox{otherwise}.\end{array}\right.

\hat{C}_{n}^{k_{\mathrm{L}}\mathrm{nn}}(x):=\left\{\begin{array}[]{ll}1&\mbox{if $\hat{S}_{n}(x)\geq 1/2$};\\ 0&\mbox{otherwise}.\end{array}\right.

K_{\beta}\equiv K_{\beta,n}:=\bigl{\{}\lceil(n-1)^{\beta}\rceil,\lceil(n-1)^{\beta}\rceil+1,\ldots,\lfloor(n-1)^{1-\beta}\rfloor\bigr{\}}

K_{\beta}\equiv K_{\beta,n}:=\bigl{\{}\lceil(n-1)^{\beta}\rceil,\lceil(n-1)^{\beta}\rceil+1,\ldots,\lfloor(n-1)^{1-\beta}\rfloor\bigr{\}}

\max\biggl{\{}\|\dot{\bar{f}}(x_{0})\|,\sup_{u\in B_{\epsilon_{0}}(0)}\|\ddot{\bar{f}}(x_{0}+u)\|_{\mathrm{op}}\biggr{\}}\leq\bar{f}(x_{0})\ell\bigl{(}\bar{f}(x_{0})\bigr{)},

\max\biggl{\{}\|\dot{\bar{f}}(x_{0})\|,\sup_{u\in B_{\epsilon_{0}}(0)}\|\ddot{\bar{f}}(x_{0}+u)\|_{\mathrm{op}}\biggr{\}}\leq\bar{f}(x_{0})\ell\bigl{(}\bar{f}(x_{0})\bigr{)},

p_{r} (x) \geq ϵ_{0} a_{d} r^{d} \overset{ˉ}{f} (x) .

p_{r} (x) \geq ϵ_{0} a_{d} r^{d} \overset{ˉ}{f} (x) .

x, z \in S^{2 ϵ_{0}} : ∥ z - x ∥ \leq g (ϵ) sup ∥ \overset{η}{¨} (z) - \overset{η}{¨} (x) ∥_{op} \leq ϵ .

x, z \in S^{2 ϵ_{0}} : ∥ z - x ∥ \leq g (ϵ) sup ∥ \overset{η}{¨} (z) - \overset{η}{¨} (x) ∥_{op} \leq ϵ .

|\eta(x)-1/2|\geq\frac{1}{\ell\bigl{(}\bar{f}(x)\bigr{)}}

|\eta(x)-1/2|\geq\frac{1}{\ell\bigl{(}\bar{f}(x)\bigr{)}}

B_{1} := \int_{S} \frac{f ˉ ( x _{0} )}{4∥ η ˙ ( x _{0} ) ∥} d Vol^{d - 1} (x_{0}) and B_{2} := \int_{S} \frac{f ˉ ( x _{0} ) ^{1 - 4/ d}}{∥ η ˙ ( x _{0} ) ∥} a (x_{0})^{2} d Vol^{d - 1} (x_{0}),

B_{1} := \int_{S} \frac{f ˉ ( x _{0} )}{4∥ η ˙ ( x _{0} ) ∥} d Vol^{d - 1} (x_{0}) and B_{2} := \int_{S} \frac{f ˉ ( x _{0} ) ^{1 - 4/ d}}{∥ η ˙ ( x _{0} ) ∥} a (x_{0})^{2} d Vol^{d - 1} (x_{0}),

a(x):=\frac{\sum_{j=1}^{d}\bigl{\{}\eta_{j}(x)\bar{f}_{j}(x)+\frac{1}{2}\eta_{jj}(x)\bar{f}(x)\bigr{\}}}{(d+2)a_{d}^{2/d}\bar{f}(x)}.

a(x):=\frac{\sum_{j=1}^{d}\bigl{\{}\eta_{j}(x)\bar{f}_{j}(x)+\frac{1}{2}\eta_{jj}(x)\bar{f}(x)\bigr{\}}}{(d+2)a_{d}^{2/d}\bar{f}(x)}.

\sup_{P\in\mathcal{P}_{d,\theta}}\biggl{|}R(\hat{C}_{n}^{k\mathrm{nn}})-R(C^{\mathrm{Bayes}})-\frac{B_{1}}{k}-B_{2}\Bigl{(}\frac{k}{n}\Bigr{)}^{4/d}\biggr{|}=o\biggl{(}\frac{1}{k}+\Bigl{(}\frac{k}{n}\Bigr{)}^{4/d}\biggr{)}

\sup_{P\in\mathcal{P}_{d,\theta}}\biggl{|}R(\hat{C}_{n}^{k\mathrm{nn}})-R(C^{\mathrm{Bayes}})-\frac{B_{1}}{k}-B_{2}\Bigl{(}\frac{k}{n}\Bigr{)}^{4/d}\biggr{|}=o\biggl{(}\frac{1}{k}+\Bigl{(}\frac{k}{n}\Bigr{)}^{4/d}\biggr{)}

\sup_{P\in\mathcal{P}_{d,\theta}}\biggl{|}R(\hat{C}_{n}^{k\mathrm{nn}})-R(C^{\mathrm{Bayes}})-\frac{B_{1}}{k}\biggr{|}=o\Bigl{(}\frac{1}{k}+\Bigl{(}\frac{k}{n}\Bigr{)}^{\frac{\rho}{\rho+d}-\epsilon}\Bigr{)}

\sup_{P\in\mathcal{P}_{d,\theta}}\biggl{|}R(\hat{C}_{n}^{k\mathrm{nn}})-R(C^{\mathrm{Bayes}})-\frac{B_{1}}{k}\biggr{|}=o\Bigl{(}\frac{1}{k}+\Bigl{(}\frac{k}{n}\Bigr{)}^{\frac{\rho}{\rho+d}-\epsilon}\Bigr{)}

\liminf_{n\to\infty}\inf_{k\in K_{\beta}}\biggl{\{}k+\Bigl{(}\frac{n}{k}\Bigr{)}^{1+\epsilon}\biggr{\}}\bigl{\{}R(\hat{C}_{n}^{k\mathrm{nn}})-R(C^{\mathrm{Bayes}})\bigr{\}}>0

\liminf_{n\to\infty}\inf_{k\in K_{\beta}}\biggl{\{}k+\Bigl{(}\frac{n}{k}\Bigr{)}^{1+\epsilon}\biggr{\}}\bigl{\{}R(\hat{C}_{n}^{k\mathrm{nn}})-R(C^{\mathrm{Bayes}})\bigr{\}}>0

k_{\mathrm{O}}(x):=\max\Bigl{[}\lceil(n-1)^{\beta}\rceil\,,\,\min\bigl{\{}\bigl{\lfloor}B\bigl{\{}\bar{f}(x)(n-1)\bigr{\}}^{4/(d+4)}\bigr{\rfloor}\,,\,\lfloor(n-1)^{1-\beta}\rfloor\bigr{\}}\Bigr{]},

k_{\mathrm{O}}(x):=\max\Bigl{[}\lceil(n-1)^{\beta}\rceil\,,\,\min\bigl{\{}\bigl{\lfloor}B\bigl{\{}\bar{f}(x)(n-1)\bigr{\}}^{4/(d+4)}\bigr{\rfloor}\,,\,\lfloor(n-1)^{1-\beta}\rfloor\bigr{\}}\Bigr{]},

\sup_{P\in\mathcal{P}_{d,\theta}}\Bigl{|}R(\hat{C}_{n}^{k_{\mathrm{O}}\mathrm{nn}})-R(C^{\mathrm{Bayes}})-B_{3}n^{-4/(d+4)}\Bigr{|}=o(n^{-4/(d+4)}),

\sup_{P\in\mathcal{P}_{d,\theta}}\Bigl{|}R(\hat{C}_{n}^{k_{\mathrm{O}}\mathrm{nn}})-R(C^{\mathrm{Bayes}})-B_{3}n^{-4/(d+4)}\Bigr{|}=o(n^{-4/(d+4)}),

B_{3}:=\int_{\mathcal{S}}\frac{\bar{f}(x_{0})^{d/(d+4)}}{\|\dot{\eta}(x_{0})\|}\Bigl{\{}\frac{1}{4B}+B^{4/d}a(x_{0})^{2}\Bigr{\}}\,d\mathrm{Vol}^{d-1}(x_{0})<\infty.

B_{3}:=\int_{\mathcal{S}}\frac{\bar{f}(x_{0})^{d/(d+4)}}{\|\dot{\eta}(x_{0})\|}\Bigl{\{}\frac{1}{4B}+B^{4/d}a(x_{0})^{2}\Bigr{\}}\,d\mathrm{Vol}^{d-1}(x_{0})<\infty.

\sup_{P\in\mathcal{P}_{d,\theta}}\bigl{\{}R(\hat{C}_{n}^{k_{\mathrm{O}}\mathrm{nn}})-R(C^{\mathrm{Bayes}})\bigr{\}}=o(n^{-\rho/(\rho+d)+\beta+\epsilon}),

\sup_{P\in\mathcal{P}_{d,\theta}}\bigl{\{}R(\hat{C}_{n}^{k_{\mathrm{O}}\mathrm{nn}})-R(C^{\mathrm{Bayes}})\bigr{\}}=o(n^{-\rho/(\rho+d)+\beta+\epsilon}),

\hat{f}_{m}(x)=\hat{f}_{m,h}(x):=\frac{1}{mh^{d}}\sum_{j=1}^{m}K\Bigl{(}\frac{x-X_{n+j}}{h}\Bigr{)}.

\hat{f}_{m}(x)=\hat{f}_{m,h}(x):=\frac{1}{mh^{d}}\sum_{j=1}^{m}K\Bigl{(}\frac{x-X_{n+j}}{h}\Bigr{)}.

k_{\mathrm{SS}}(x):=\max\Bigl{[}\lceil(n-1)^{\beta}\rceil\,,\,\min\bigl{\{}\lfloor B\{\hat{f}_{m}(x)(n-1)\}^{4/(d+4)}\rfloor\,,\,\lfloor(n-1)^{1-\beta}\rfloor\bigr{\}}\Bigr{]}.

k_{\mathrm{SS}}(x):=\max\Bigl{[}\lceil(n-1)^{\beta}\rceil\,,\,\min\bigl{\{}\lfloor B\{\hat{f}_{m}(x)(n-1)\}^{4/(d+4)}\rfloor\,,\,\lfloor(n-1)^{1-\beta}\rfloor\bigr{\}}\Bigr{]}.

∥ \overset{ˉ}{f} (y) - \overset{ˉ}{f} (x) ∥ \leq λ ∥ y - x ∥^{γ} for all x, y \in R^{d} .

∥ \overset{ˉ}{f} (y) - \overset{ˉ}{f} (x) ∥ \leq λ ∥ y - x ∥^{γ} for all x, y \in R^{d} .

∥ \dot{\overset{ˉ}{f}} (y) - \dot{\overset{ˉ}{f}} (x) ∥ \leq λ ∥ y - x ∥^{γ - 1} for all x, y \in R^{d} .

∥ \dot{\overset{ˉ}{f}} (y) - \dot{\overset{ˉ}{f}} (x) ∥ \leq λ ∥ y - x ∥^{γ - 1} for all x, y \in R^{d} .

\sup_{P\in\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}}\Bigl{|}R(\hat{C}_{n}^{k_{\mathrm{SS}}\mathrm{nn}})-R(C^{\mathrm{Bayes}})-B_{3}n^{-4/(d+4)}\Bigr{|}=o(n^{-4/(d+4)})

\sup_{P\in\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}}\Bigl{|}R(\hat{C}_{n}^{k_{\mathrm{SS}}\mathrm{nn}})-R(C^{\mathrm{Bayes}})-B_{3}n^{-4/(d+4)}\Bigr{|}=o(n^{-4/(d+4)})

\sup_{P\in\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}}\bigl{\{}R(\hat{C}_{n}^{k_{\mathrm{SS}}\mathrm{nn}})-R(C^{\mathrm{Bayes}})\bigr{\}}=o(n^{-\rho/(\rho+d)+\beta+\epsilon}),

\sup_{P\in\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}}\bigl{\{}R(\hat{C}_{n}^{k_{\mathrm{SS}}\mathrm{nn}})-R(C^{\mathrm{Bayes}})\bigr{\}}=o(n^{-\rho/(\rho+d)+\beta+\epsilon}),

\sup_{P\in\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}}\mathbb{P}\biggl{(}\|\hat{f}_{m}-\bar{f}\|_{\infty}\geq\frac{1}{(n-1)^{1-\alpha/2}}\biggr{)}=o(n^{-4/(d+4)}).

\sup_{P\in\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}}\mathbb{P}\biggl{(}\|\hat{f}_{m}-\bar{f}\|_{\infty}\geq\frac{1}{(n-1)^{1-\alpha/2}}\biggr{)}=o(n^{-4/(d+4)}).

C_{n} in f P \in P_{d, θ} \cap Q_{d, γ, λ} sup {R (C_{n}) - R (C^{Bayes})} \geq c g^{- 1} (1/ q)^{\frac{2 d ( 1 + ν )}{4 + d + ν ( ρ + d )}} n^{- \frac{4 + ν ρ}{4 + d + ν ( ρ + d )}},

C_{n} in f P \in P_{d, θ} \cap Q_{d, γ, λ} sup {R (C_{n}) - R (C^{Bayes})} \geq c g^{- 1} (1/ q)^{\frac{2 d ( 1 + ν )}{4 + d + ν ( ρ + d )}} n^{- \frac{4 + ν ρ}{4 + d + ν ( ρ + d )}},

C_{n} in f P \in P_{d, θ} \cap Q_{d, γ, λ} sup {R (C_{n}) - R (C^{Bayes})} \geq c n^{- (m i n {\frac{4}{4 + d}, \frac{ρ}{ρ + d}} + ϵ)} .

C_{n} in f P \in P_{d, θ} \cap Q_{d, γ, λ} sup {R (C_{n}) - R (C^{Bayes})} \geq c n^{- (m i n {\frac{4}{4 + d}, \frac{ρ}{ρ + d}} + ϵ)} .

g^{- 1} (1/ q_{n})^{\frac{2 d ( 1 + ν )}{4 + d + ν ( ρ + d )}} n^{- \frac{4 + ν ρ}{4 + d + ν ( ρ + d )}}

g^{- 1} (1/ q_{n})^{\frac{2 d ( 1 + ν )}{4 + d + ν ( ρ + d )}} n^{- \frac{4 + ν ρ}{4 + d + ν ( ρ + d )}}

\overset{μ}{^}_{n} (x) = \overset{μ}{^}_{n} (x, x^{n}) := E {\hat{S}_{n} (x) ∣ X^{n} = x^{n}} = \frac{1}{k _{L} ( x )} i = 1 \sum k_{L} (x) η (x_{(i)}),

\overset{μ}{^}_{n} (x) = \overset{μ}{^}_{n} (x, x^{n}) := E {\hat{S}_{n} (x) ∣ X^{n} = x^{n}} = \frac{1}{k _{L} ( x )} i = 1 \sum k_{L} (x) η (x_{(i)}),

\overset{σ}{^}_{n}^{2} (x) = \overset{σ}{^}_{n}^{2} (x, x^{n}) := Var {\hat{S}_{n} (x) ∣ X^{n} = x^{n}} = \frac{1}{k _{L} ( x ) ^{2}} i = 1 \sum k_{L} (x) η (x_{(i)}) {1 - η (x_{(i)})} .

\overset{σ}{^}_{n}^{2} (x) = \overset{σ}{^}_{n}^{2} (x, x^{n}) := Var {\hat{S}_{n} (x) ∣ X^{n} = x^{n}} = \frac{1}{k _{L} ( x ) ^{2}} i = 1 \sum k_{L} (x) η (x_{(i)}) {1 - η (x_{(i)})} .

c_{n}:=\sup_{x_{0}\in\mathcal{S}}\ell\biggl{(}\frac{k_{\mathrm{L}}(x_{0})}{n-1}\biggr{)}.

c_{n}:=\sup_{x_{0}\in\mathcal{S}}\ell\biggl{(}\frac{k_{\mathrm{L}}(x_{0})}{n-1}\biggr{)}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Local nearest neighbour classification with applications to semi-supervised learning

Timothy I. Cannings

label=e1][email protected]=u1 [[

url]http://www.maths.ed.ac.uk/%7Etcannings

Thomas B. Berrettlabel=e2][email protected]=u2 [[

url]www.statslab.cam.ac.uk/%7Etbb26

Richard J. Samworthlabel=e3][email protected]=u3 [[

url]www.statslab.cam.ac.uk/%7Erjs57

University of Edinburgh\thanksmarkm1 and University of Cambridge\thanksmarkm2

School of Mathematics

James Clerk Maxwell Building

Peter Guthrie Tait Road

Edinburgh EH9 3FD

Statistical Laboratory

Centre for Mathematical Sciences

Wilberforce Road

Cambridge CB3 0WB

Abstract

We derive a new asymptotic expansion for the global excess risk of a local- $k$ -nearest neighbour classifier, where the choice of $k$ may depend upon the test point. This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the decision boundary of the optimal Bayes classifier, but we also show that if these conditions are not satisfied, then the dominant contribution may arise from the tails of the marginal distribution of the features. Moreover, we prove that, provided the $d$ -dimensional marginal distribution of the features has a finite $\rho$ th moment for some $\rho>4$ (as well as other regularity conditions), a local choice of $k$ can yield a rate of convergence of the excess risk of $O(n^{-4/(d+4)})$ , where $n$ is the sample size, whereas for the standard $k$ -nearest neighbour classifier, our theory would require $d\geq 5$ and $\rho>4d/(d-4)$ finite moments to achieve this rate. These results motivate a new $k$ -nearest neighbour classifier for semi-supervised learning problems, where the unlabelled data are used to obtain an estimate of the marginal feature density, and fewer neighbours are used for classification when this density estimate is small. Our worst-case rates are complemented by a minimax lower bound, which reveals that the local, semi-supervised $k$ -nearest neighbour classifier attains the minimax optimal rate over our classes for the excess risk, up to a subpolynomial factor in $n$ . These theoretical improvements over the standard $k$ -nearest neighbour classifier are also illustrated through a simulation study.

62G20,

classification problems,

nearest neighbours,

nonparametric classification,

semi-supervised learning,

keywords:

[class=MSC]

keywords:

\arxiv

arXiv:1704.00642 \startlocaldefs\endlocaldefs

and

t1Research supported by an Engineering and Physical Sciences Research Council (EPSRC) programme grant.

t2Research supported by an EPSRC Fellowship and programme grant, as well as a grant from the Leverhulme Trust.

1 Introduction

Supervised classification problems represent some of the most frequently-occurring statistical challenges in a wide variety of fields, including fraud detection, medical diagnoses and targeted advertising, to name just a few. The area has received an enormous amount of attention within both the statistics and machine learning communities; for an excellent survey with pointers to much of the relevant literature, see Boucheron et al. (2005).

The $k$ -nearest neighbour classifier, which assigns the test point according to a majority vote over the classes of its $k$ nearest points in the training set, was introduced in the seminal work of Fix and Hodges (1951) (later republished as Fix and Hodges (1989)), and is arguably the simplest and most intuitive nonparametric classifier. Cover and Hart (1967) provided mild conditions under which the asymptotic risk of the $1$ -nearest neighbour classifier is bounded above by twice the risk of the optimal Bayes classifier. Stone (1977) proved that if $k=k_{n}$ is chosen such that $k\rightarrow\infty$ and $k/n\rightarrow 0$ as $n\rightarrow\infty$ , then the $k$ -nearest neighbour classifier is universally consistent, in the sense that under any data generating mechanism, its risk converges to the Bayes risk. Further recent contributions, some of which treat the $k$ -nearest neighbour classifier as a special case of a plug-in classifier, include Kulkarni and Posner (1995), Audibert and Tsybakov (2007), Hall et al. (2008), Biau et al. (2010), Samworth (2012), Chaudhuri and Dasgupta (2014) and Celisse and Mary-Huard (2018). Nearest neighbour methods have also been extensively used in other statistical problems, including density estimation (Loftsgaarden and Quesenberry, 1965; Mack and Rosenblatt, 1979; Mack, 1983), nonparametric clustering, (Heckel and Bölcskei, 2015), entropy and other functional estimation (Kozachenko and Leonenko, 1987; Berrett et al., 2019; Berrett and Samworth, 2019a) and testing problems (Schilling, 1986; Berrett and Samworth, 2019b); see also the recent book Biau and Devroye (2015).

Despite these aforementioned works, the behaviour of the $k$ -nearest neighbour classifier in the tails of a distribution remains poorly understood. Indeed, writing $(X,Y)$ for a generic data pair, where the $d$ -dimensional feature vector $X$ has marginal density $\bar{f}$ and $Y$ denotes a binary class label, most of the results in the papers mentioned in the previous paragraph pertain either to situations where $\bar{f}$ is compactly supported and bounded away from zero on its support, or where the excess risk over that of the Bayes classifier is computed only over a compact subset of $\mathbb{R}^{d}$ . As such, many questions remain regarding the effect of tail behaviour on the excess risk.

In this paper, we consider classes of distributions that allow the feature vectors to have unbounded support. Our first goal is to provide a new asymptotic expansion for the global excess risk of a $k$ -nearest neighbour classifier, whose error term can be bounded uniformly over our classes (Theorem 1). This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the decision boundary of the Bayes classifier, but we also show that if these conditions are not satisfied, then the dominant contribution may arise from the tails of the marginal distribution of the features. The threshold for these two different regimes is governed by a parameter $\rho$ that controls the number of finite moments of the marginal feature distribution: if $d\geq 5$ and $\rho>4d/(d-4)$ , then we obtain a rate of $O(n^{-4/(d+4)})$ uniformly over our classes, while if $d\leq 4$ or $d\geq 5$ and $\rho\leq 4d/(d-4)$ then our rate is slower, namely $O(n^{-\frac{\rho}{2\rho+d}+\epsilon})$ , for every $\epsilon>0$ .

The proof of Theorem 1 also reveals a local bias-variance trade-off that motivates a modification of the standard $k$ -nearest neighbour classifier in semi-supervised learning settings, where, as well as the labelled training data, we have access to another, independent, sample of unlabelled observations. Such semi-supervised problems occur in a wide range of applications, especially where it is expensive or time-consuming to obtain the labels associated with observations; in fact, it is often the case that unlabelled observations may vastly outnumber labelled ones. For an overview of semi-supervised learning applications and techniques, see Chapelle et al. (2006).

Our second contribution is to propose to allow the choice of $k$ in $k$ -nearest neighbour classification to depend on an estimate of $\bar{f}$ at the test point $x\in\mathbb{R}^{d}$ in semi-supervised settings. Such a local choice of $k$ is analagous to the use of local bandwidths in the context of kernel density estimation, as studied by, e.g., Breiman et al. (1977), Abramson (1982) and Giné and Sang (2010). However, for density estimation, it is more common to choose a family of bandwidths $\{h(X_{i}):i=1,\ldots,n\}$ rather than $h=h(x)$ , to ensure that the resulting estimate is itself a density. Moreover, theory there suggests that one should then choose $h(X_{i})\propto\bar{f}^{-1/2}(X_{i})$ in order to cancel the leading term in the asymptotic bias expansion (Abramson, 1982). By contrast, we find that when choosing $k=k(x)$ , by using fewer neighbours in low density regions, we are able to achieve a better balance in the local bias-variance trade-off for estimating our main quantity of interest, namely the regression function. In particular, we initially study an oracle choice of $k=k(x)$ that depends on $\bar{f}(x)$ , and show that the excess risk of the resulting classifier, computed over the whole of $\mathbb{R}^{d}$ , is $O(n^{-4/(d+4)})$ , again uniformly over our classes, for every $d\in\mathbb{N}$ and provided only that $\rho>4$ . Moreover, in the more challenging case where $\rho\leq 4$ , we obtain a rate of $O(n^{-\frac{\rho}{\rho+d}+\epsilon})$ , for every $\epsilon>0$ , which still reflects an improvement through the locally-adaptive choice of $k$ . Assuming further that $\bar{f}$ has Hölder smoothness $\gamma\in(0,2]$ , we show that if $m$ additional, unlabelled observations are used to estimate $\bar{f}$ by $\hat{f}_{m}$ , and if $m=m_{n}$ satisfies $\liminf_{n\rightarrow\infty}m_{n}/n^{2+d/\gamma}>0$ , then our semi-supervised $k$ -nearest-neighbour classifier mimics the asymptotic performance of the oracle.

Finally, we consider corresponding minimax lower bounds. We show in particular that the rates of convergence achieved by our semi-supervised, local- $k$ -nearest neighbour classifier are optimal up to subpolynomial factors in $n$ . Interestingly, our arguments also reveal that these rates cannot be improved with the additional knowledge of $\bar{f}$ .

As mentioned previously, studies of global excess risk rates of convergence in nonparametric classification for unbounded feature vector distributions are comparatively rare. Hall and Kang (2005) studied the tail error properties of a classifier based on kernel density estimates of the class conditional densities for univariate data. As an illustrative example, they showed that if, for large $x$ , one class has density $ax^{-\alpha}$ , while the other has density $bx^{-\beta}$ , for some $a,b>0$ and $1<\alpha<\beta<\alpha+1<\infty$ , then the excess risk from the right tail is of larger order than that in the body of the distribution.

Perhaps most closely related to this work, Gadat et al. (2016) recently obtained upper bounds on the supremum excess risk of the $k$ -nearest neighbour classifier, over classes where $\eta$ is Lipschitz, the well-known margin assumption of Mammen and Tsybakov (1999) is satisfied with parameter $\alpha>0$ , and assuming the tail condition that $\mathbb{P}\{\bar{f}(X)<\delta\}\leq\psi(\delta)$ is satisfied for some function $\psi$ and sufficiently small $\delta>0$ . Gadat et al. (2016) obtained a minimax lower bound over these classes, as well as providing an upper bound for the rate of the standard $k$ -nearest neighbour classifier. Since these rates do not match, they further introduced regions of the form $\bigl{\{}\bar{f}^{-1}\bigl{(}(a_{j+1},a_{j}]\bigr{)}:j\in\mathbb{N}\bigr{\}}$ with $a_{j+1}=a_{j}/2$ , and proved that when we choose $k=k(j)$ and specialise to the case where $\psi$ is the identity function, the resulting sliced $k$ -nearest neighbour classifier attains the minimax optimal rate of $n^{-(1+\alpha)/(2+\alpha+d)}$ up to a polylogarithmic factor in $n$ . Neither our smoothness and tail assumptions, nor our conclusions are directly comparable with the work of Gadat et al. (2016). In particular, we make a stronger smoothness assumption on $\eta$ in a neighbourhood of the Bayes decision boundary, implying that the margin assumption holds with parameter $\alpha=1$ ; see Lemma A.12 in Appendix A. This enables us to show that our semi-supervised classifier attains faster rates than are achievable under just a Lipschitz condition, and that these rates are minimax optimal up to subpolynomial factors in $n$ , over all possible values of our tail parameter $\rho$ ; moreover, we are also able to provide the leading constants in the asymptotic expansion of the excess risk in some cases.

The remainder of this paper is organised as follows. After introducing our setting in Section 2, we present in Section 3 our main results for the standard $k$ -nearest neighbour classifier. This leads on, in Section 4, to our study of the semi-supervised setting, where we derive asymptotic results of the excess risk of our local- $k$ -nearest neighbour classifier. Our minimax lower bound in presented in Section 5. The main arguments of the proofs of our theoretical results are given in Section 6, while in the appendices, we prove several claims made in the main text, bound various remainder terms, illustrate the finite-sample benefits of the semi-supervised classifier over the standard $k$ -nearest neighbour classifier in a simulation study and provide an introduction to the ideas of differential geometry that underpin much of our analysis.

Finally we fix here some notation used throughout the paper. Let $\|\cdot\|$ denote the Euclidean norm and, for $r>0$ and $x\in\mathbb{R}^{d}$ , let $B_{r}(x):=\{z\in\mathbb{R}^{d}:\|x-z\|<r\}$ and $\bar{B}_{r}(x):=\{z\in\mathbb{R}^{d}:\|x-z\|\leq r\}$ denote respectively the open and closed Euclidean balls of radius $r$ centred at $x$ . Let $a_{d}:=\frac{2\pi^{d/2}}{d\Gamma(d/2)}$ denote the $d$ -dimensional Lebesgue measure of $B_{1}(0)$ . For a real-valued function $g$ defined on $A\subseteq\mathbb{R}^{d}$ that is twice differentiable at $x$ , write $\dot{g}(x)=(g_{1}(x),\ldots,g_{d}(x))^{T}$ and $\ddot{g}(x)=\bigl{(}g_{jk}(x)\bigr{)}$ for its gradient vector and Hessian matrix at $x$ , and let $\|g\|_{\infty}=\sup_{x\in A}|g(x)|$ . We write $\|\cdot\|_{\mathrm{op}}$ for the operator norm of a matrix.

2 Statistical setting

Let $(X,Y),(X_{1},Y_{1}),\dots,(X_{n+m},Y_{n+m})$ be independent and identically distributed random pairs taking values in $\mathbb{R}^{d}\times\{0,1\}$ . Let $\pi_{r}:=\mathbb{P}(Y=r)$ , for $r=0,1$ , and $X|Y=r\sim P_{r}$ , for $r=0,1$ , where $P_{r}$ is a probability measure on $\mathbb{R}^{d}$ . Let $\eta(x):=\mathbb{P}(Y=1|X=x)$ denote the regression function and $P_{X}:=\pi_{0}P_{0}+\pi_{1}P_{1}$ denote the marginal distribution of $X$ . We observe labelled training data, $\mathcal{T}_{n}:=\{(X_{1},Y_{1}),\dots,(X_{n},Y_{n})\}$ , and unlabelled training data, $\mathcal{T}_{m}^{\prime}:=\{X_{n+1},\dots,X_{n+m}\}$ , and are presented with the task of assigning the test point $X$ to either class 0 or 1.

A classifier is a Borel measurable function $C:\mathbb{R}^{d}\to\{0,1\}$ , with the interpretation that $C$ assigns $x\in\mathbb{R}^{d}$ to the class $C(x)$ . Given a Borel measurable set $\mathcal{R}\subseteq\mathbb{R}^{d}$ , the misclassification rate, or risk, over $\mathcal{R}$ is

[TABLE]

When $\mathcal{R}=\mathbb{R}^{d}$ , we drop the subscript for convenience. The Bayes classifier

[TABLE]

minimises the risk over any region $\mathcal{R}$ (Devroye et al., 1996, p. 20). The performance of a classifier $C$ is therefore measured via its excess risk, $R_{\mathcal{R}}(C)-R_{\mathcal{R}}(C^{\mathrm{Bayes}})$ .

We can now formally define the local- $k$ -nearest neighbour classifier, which allows the number of neighbours considered to vary depending on the location of the test point. Suppose $k_{\mathrm{L}}:\mathbb{R}^{d}\to\{1,\dots,n\}$ is measurable. Given the test point $x\in\mathbb{R}^{d}$ , let $(X_{(1)},Y_{(1)}),\ldots,(X_{(n)},Y_{(n)})$ be a reordering of the training data such that $\|X_{(1)}-x\|\leq\dots\leq\|X_{(n)}-x\|$ . We will later assume that $P_{X}$ is absolutely continuous with respect to $d$ -dimensional Lebesgue measure, which ensures that ties occur with probability zero; where helpful for clarity, we also write $X_{(i)}(x)$ for the $i$ th nearest neighbour of $x$ . Let $\hat{S}_{n}(x):=k_{\mathrm{L}}(x)^{-1}\sum_{i=1}^{k_{\mathrm{L}}(x)}\mathbbm{1}_{\{Y_{(i)}=1\}}$ . Then the local- $k$ -nearest neighbour ( $k_{\mathrm{L}}$ nn) classifier is defined to be

[TABLE]

Given $k\in\{1,\ldots,n\}$ , let $k_{0}$ denote the constant function $k_{0}(x):=k$ for all $x\in\mathbb{R}^{d}$ . Using $k_{\mathrm{L}}=k_{0}$ the definition above reduces to the standard $k$ -nearest neighbour classifier ( $k$ nn), and we will write $\hat{C}_{n}^{k\mathrm{nn}}$ in place of $\hat{C}_{n}^{k_{0}\mathrm{nn}}$ . For $\beta\in(0,1/2)$ , let

[TABLE]

denote a range of values of $k$ that will be of interest to us. Note that $K_{\beta_{1}}\supseteq K_{\beta_{2}}$ , for $\beta_{1}<\beta_{2}$ . Moreover, when $\beta$ is small, the restriction that $k\in K_{\beta}$ is only a slightly stronger requirement than the consistency conditions of Stone (1977), namely that $k=k_{n}\to\infty$ , $k_{n}/n\to 0$ as $n\to\infty$ .

3 Global risk of the $k$ -nearest neighbour classifier

In this section we provide an asymptotic expansion for the global risk of the standard (non-local) $k$ -nearest neighbour classifier. We first define the classes of data generating mechanisms over which our results will hold. Let $\mathcal{L}$ denote the class of decreasing functions $\ell:(0,\infty)\rightarrow[1,\infty)$ such that $\ell(\delta)=o(\delta^{-\tau})$ as $\delta\searrow 0$ , for every $\tau>0$ . Let $\mathcal{G}$ denote the class of strictly increasing functions $g:(0,1)\rightarrow(0,1)$ with $g(\epsilon)=o(\epsilon^{M})$ as $\epsilon\searrow 0$ , for every $M>0$ . Recall from Section 2 that, to any distribution $P$ on $\mathbb{R}^{d}\times\{0,1\}$ , we associate conditional distributions $P_{0},P_{1}$ , a regression function $\eta$ , marginal probabilities $\pi_{0},\pi_{1}$ and a marginal distribution $P_{X}$ . Now, for $\Theta:=(0,\infty)\times[1,\infty)\times(0,\infty)\times\mathcal{L}\times\mathcal{G}$ , and $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)\in\Theta$ , let $\mathcal{P}_{d,\theta}$ denote the class of distributions $P$ on $\mathbb{R}^{d}\times\{0,1\}$ such that the probability measures $P_{0}$ and $P_{1}$ are absolutely continuous with respect to Lebesgue measure, with Radon–Nikodym derivatives $f_{0}$ and $f_{1}$ , respectively. Moreover, we assume that there exist versions of $f_{0}$ and $f_{1}$ for which the following conditions hold:

(A.1)

The marginal density of $X$ , namely $\bar{f}:=\pi_{0}f_{0}+\pi_{1}f_{1}$ , is continuous $P_{X}$ -almost everywhere and the set $\mathcal{X}_{\bar{f}}$ of continuity points of $\bar{f}$ is open.

Thus $\eta(x):=\pi_{1}f_{1}(x)/\{\pi_{0}f_{0}(x)+\pi_{1}f_{1}(x)\}$ , where we define $0/0:=0$ . Let $\mathcal{S}:=\{x\in\mathbb{R}^{d}:\eta(x)=1/2\}$ and, for $\epsilon>0$ , let $\mathcal{S}^{\epsilon}:=\mathcal{S}+B_{\epsilon}(0)$ . In our assumptions below, we will place further assumptions on $\mathcal{S}$ , which ensure not only that this set is non-empty, but in fact that it is a $(d-1)$ -dimensional, orientable manifold.

(A.2)

The set $\mathcal{S}\cap\{x\in\mathbb{R}^{d}:\bar{f}(x)>0\}$ is non-empty and $\sup_{x_{0}\in\mathcal{S}}\bar{f}(x_{0})\leq M_{0}$ . The function $\bar{f}$ is twice continuously differentiable on $\mathcal{S}^{\epsilon_{0}}$ , and

[TABLE]

for all $x_{0}\in\mathcal{S}$ . Furthermore, writing $p_{r}(x):=P_{X}\bigl{(}B_{r}(x)\bigr{)}$ , we have for all $x\in\mathbb{R}^{d}\setminus\mathcal{S}^{\epsilon_{0}}$ and $r\in(0,\epsilon_{0}]$ that

[TABLE]

(A.3)

We have that $\eta$ is twice differentiable on $\mathcal{S}^{2\epsilon_{0}}$ with $\inf_{x_{0}\in\mathcal{S}}\|\dot{\eta}(x_{0})\|\geq\epsilon_{0}M_{0}$ . Moreover, $\sup_{x\in\mathcal{S}^{2\epsilon_{0}}}\|\dot{\eta}(x)\|\leq M_{0}$ , $\sup_{x\in\mathcal{S}^{2\epsilon_{0}}}\|\ddot{\eta}(x)\|_{\mathrm{op}}\leq M_{0}$ and given $\epsilon>0$ ,

[TABLE]

Finally, the function $\eta$ is continuous on $\{x:\bar{f}(x)>0\}$ , and

[TABLE]

for all $x\in\mathbb{R}^{d}\setminus\mathcal{S}^{\epsilon_{0}}$ .

(A.4)

We have $\int_{\mathbb{R}^{d}}\|x\|^{\rho}\,dP_{X}(x)\leq M_{0}$ .

Example 1.

Consider the distribution $P$ on $\mathbb{R}^{d}\times\{0,1\}$ for which $\bar{f}(x)=\frac{\Gamma(3+d/2)}{2\pi^{d/2}}(1-\|x\|^{2})^{2}\mathbbm{1}_{\{x\in B_{1}(0)\}}$ and $\eta(x)=\min(\|x\|^{2},1)$ . In Appendix B, we show that $P\in\mathcal{P}_{d,\theta}$ with $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)\in\Theta$ for any $\rho>0$ , $g\in\mathcal{G}$ , and provided that $M_{0}\geq\max\bigl{\{}2,\frac{\Gamma(3+d/2)}{8\pi^{d/2}}\bigr{\}}$ , $\epsilon_{0}\leq\min\bigl{(}\frac{1}{10},2^{-d},\frac{2^{1/2}}{M_{0}}\bigr{)}$ and $\ell\in\mathcal{L}$ satisfies $\ell(\delta)\geq\max(48,\epsilon_{0}^{-1})$ for all $\delta>0$ .

Asking for $P_{X}$ to have a Lebesgue density allows us to define the tail of the distribution as the region where $\bar{f}$ is smaller than some threshold. Condition (A.1) ensures that for all $\delta>0$ sufficiently small, the set $\mathcal{R}:=\{x:\bar{f}(x)>\delta\}\cap\mathcal{X}_{\bar{f}}$ is a $d$ -dimensional manifold, and $P_{X}(\mathcal{R}^{c})\leq\mathbb{P}\bigl{\{}\bar{f}(X)\leq\delta\bigr{\}}$ , where the latter quantity can be bounded using (A.4). The first part of (A.2) asks for a certain level of smoothness for $\bar{f}$ in a neighbourhood of $\mathcal{S}$ , and controls the behaviour of its first and second derivatives there relative to the original density. In particular, the greater degree of regularity asked of these derivatives in the tails of the marginal density in (1) allows us still to control the error of a Taylor approximation even in this region. The condition (1) is satisfied by all Gaussian and multivariate- $t$ densities, for example, for appropriate choices of $\epsilon_{0}$ and $\ell$ . The last part of (A.2) concerns the behaviour of the marginal feature distribution away from $\mathcal{S}^{\epsilon_{0}}$ and is often referred to as the strong minimal mass assumption (e.g. Gadat et al., 2016). It requires that the mass of the marginal feature distribution is not concentrated in the neighbourhood of a point and is a rather weaker condition than we ask for on $\mathcal{S}^{\epsilon_{0}}$ ; in particular, we do not insist that derivatives of $\bar{f}$ exist in this region.

The condition $\inf_{x_{0}\in\mathcal{S}}\|\dot{\eta}(x_{0})\|\geq\epsilon_{0}M_{0}$ in (A.3) asks for the class conditional densities, when weighted by their respective prior probabilities, to cross at an angle; in particular, this ensures that $\mathcal{S}$ is a $(d-1)$ -dimensional, orientable manifold (cf. Section G.3). Moreover, the bounds on the first and second derivatives of $\eta$ in a neighbourhood of $\mathcal{S}$ ensure that we can estimate $\eta$ sufficiently well. The last part of (A.3) asks that $\eta$ does not approach the critical value of $1/2$ too fast on the complement of $\mathcal{S}^{\epsilon_{0}}$ . Assumption (A.4) is a simple moment condition that, together with (A.2), ensures that the constants $B_{1}$ and $B_{2}$ in (2) below are finite where needed.

Let $d\mathrm{Vol}^{d-1}$ denote the $(d-1)$ -dimensional volume form on $\mathcal{S}$ (cf. Section G.3). Now let

[TABLE]

where

[TABLE]

We are now in a position to present our asymptotic expansion for the global excess risk of the standard $k$ -nearest neighbour classifier.

Theorem 1.

Fix $d\in\mathbb{N}$ and $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)\in\Theta$ such that $\mathcal{P}_{d,\theta}\neq\emptyset$ .

(i) Suppose that $d\geq 5$ and $\rho>\frac{4d}{d-4}$ . Then for each $\beta\in(0,1/2)$ ,

[TABLE]

as $n\to\infty$ , uniformly for $k\in K_{\beta}$ .

(ii) Suppose that either $d\leq 4$ , or, $d\geq 5$ and $\rho\leq\frac{4d}{d-4}$ . Then for each $\beta\in(0,1/2)$ and each $\epsilon>0$ we have

[TABLE]

as $n\to\infty$ , uniformly for $k\in K_{\beta}$ .

Theorem 1 reveals an interesting dichotomy: when $d\geq 5$ and $\rho>4d/(d-4)$ , the dominant contribution to the excess risk arises from the difficulty of classifying points close to the Bayes decision boundary $\mathcal{S}$ . In such settings, the excess risk of the standard $k$ -nearest neighbour classifier converges to zero at rate $O(n^{-4/(d+4)})$ when $k$ is chosen proportional to $n^{4/(d+4)}$ . On the other hand, part (ii) shows that when either $d\leq 4$ or $d\geq 5$ and $\rho\leq 4d/(d-4)$ , the dominant contribution to the excess risk when $k$ is large may come from the challenge of classifying points in the tails of the distribution. Indeed, Example 2 below provides one simple setting where this dominant contribution does come from the tails of the distribution.

Example 2.

Suppose that the joint density of $X$ at $x=(x_{1},x_{2})\in(0,1)\times\mathbb{R}$ is given by $\bar{f}(x)=2x_{1}f_{2}(x_{2})$ , where $f_{2}$ is a positive, twice continuously differentiable density with $f_{2}(x_{2})=e^{-|x_{2}|}/2$ for $|x_{2}|>1$ . Suppose also that $\eta(x)=x_{1}$ . Then the corresponding joint distribution $P$ belongs to $\mathcal{P}_{2,\theta}$ provided $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)$ is such that $M_{0}$ is sufficiently large, $\epsilon_{0}\leq\min(1/8,1/M_{0})$ and $\ell$ is a sufficiently large constant ( $\rho>0$ and $g\in\mathcal{G}$ can be chosen arbitrarily). We prove in Appendix C that for every $\beta\in(0,1/2)$ and $\epsilon>0$ ,

[TABLE]

as $n\to\infty$ . Thus the rate of convergence in this example is at best $n^{-1/2}$ , up to subpolynomial factors, whereas a rate of $n^{-2/3}$ is achievable over any compact set.

The proof of Theorem 1, and indeed the proofs of Theorems 2 and 3 that follow in Section 4 below, depend crucially on Theorem 6.7 in Section 6. This result provides an asymptotic expansion for the excess risk of a general (local or global) $k$ -nearest neighbour classifier over a region $\mathcal{R}_{n}\subseteq\{x\in\mathbb{R}^{d}:\bar{f}(x)\geq\delta_{n}(x)\}$ , where $\delta_{n}(x)$ , defined in (7) below, shrinks to zero at a rate slow enough to ensure that $X_{(k)}(x)$ concentrates around $x$ uniformly over $\mathcal{R}_{n}$ . The intuition regarding the behaviour of the excess risk, then, is that when $x\in\mathcal{R}_{n}$ and $x$ is not close to $\mathcal{S}$ , with high probability the $k$ nearest neighbours of $x$ are on the same side of $\mathcal{S}$ as $x$ ; i.e. $\mathrm{sgn}\bigl{(}\eta(X_{(i)})-1/2\bigr{)}=\mathrm{sgn}\bigl{(}\eta(x)-1/2\bigr{)}$ for $i=1,\ldots,k$ . The probability of classifying $x$ differently from the Bayes classifier can therefore be shown to be $O(n^{-M})$ for every $M>0$ , using Hoeffding’s inequality. Thus, the challenging regions for classification consist of neighbourhoods of $\mathcal{S}$ , where $\eta$ is close to $1/2$ , together with $\mathcal{R}_{n}^{c}$ , where we no longer enjoy the same nearest neighbour concentration properties. For the first of these regions, we exploit our smoothness assumptions to derive asymptotic expansions for the bias and variance of $\hat{S}_{n}(x)$ , uniformly over appropriate neighbourhoods of $\mathcal{S}$ , and using a normal approximation, we can deduce an asymptotic expansion for the excess risk, uniformly over our classes of distributions and an appropriate set of nearest neighbour classifiers. For $\mathcal{R}_{n}^{c}$ we are unable to bound the probability of classifying differently from the Bayes classifier with anything other than a trivial bound, but we can control $P_{X}(\mathcal{R}_{n}^{c})$ using (A.4).

Finally in this section, we mention that Samworth (2012) obtained a similar expansion to that in Theorem 1(i) for a fixed distribution $P$ satisfying certain smoothness conditions. However, there the risk was computed only over a compact set, so the analysis failed to elucidate the important effects of tail behaviour on the excess risk. Another key difference is that here we define classes $\mathcal{P}_{d,\theta}$ , and show that the remainder terms in our asymptotic expansion hold uniformly over these classes; the introduction of these classes further facilitates the study of corresponding minimax lower bounds in Section 5 below.

4 Local- $k$ -nearest neighbour classifiers

In this section we explore the consequences of a local choice of $k$ , compared with the global choice in Theorem 1. Initially, we consider an oracle choice, where $k$ is allowed to depend on the marginal feature density $\bar{f}$ (Section 4.1), but we then relax this to semi-supervised settings, where $\bar{f}$ can be estimated from unlabelled training data (Section 4.2).

4.1 Oracle classifier

Suppose for now that the marginal density $\bar{f}$ is known. For $\beta\in(0,1/2)$ and $B>0$ , let

[TABLE]

where the subscript O refers to the fact that this is an oracle choice of the function $k_{\mathrm{L}}$ , since it depends on $\bar{f}$ . This choice aims to balance the local bias and variance of $\hat{S}_{n}(x)$ .

Theorem 2.

Fix $d\in\mathbb{N}$ and $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)\in\Theta$ such that $\mathcal{P}_{d,\theta}\neq\emptyset$ . For each $0<B_{*}\leq B^{*}<\infty$ ,

(i) if $\rho>4$ then for $\beta<4d(\rho-4)/\{\rho(d+4)^{2}\}$ ,

[TABLE]

uniformly for $B\in[B_{*},B^{*}]$ as $n\to\infty$ , where

[TABLE]

(ii) if $\rho\leq 4$ and $\beta<\min\{1/2,4/(d+4)\}$ , then for every $\epsilon>0$

[TABLE]

uniformly for $B\in[B_{*},B^{*}]$ , as $n\to\infty$ .

Comparing Theorem 2(i) and Theorem 1(i), we see that, unlike for the global $k$ -nearest neighbour classifier, we can guarantee a $O(n^{-4/(d+4)})$ rate of convergence for the excess risk of the oracle classifier, both in low dimensions ( $d\leq 4$ ), and under a weaker condition on $\rho$ when $d\geq 5$ . In particular, the condition on $\rho$ no longer depends on the dimension of the covariates. The guarantees in Theorem 2(ii) are also stronger than those provided by Theorem 1(ii) for any global choice of $k$ . Examining the proof of Theorem 2, we find that the key difference with the proof of Theorem 1 is that we can now choose the region $\mathcal{R}_{n}$ (cf. the discussion of the proof of Theorem 1 in Section 3) to be larger.

4.2 The semi-supervised nearest neighbour classifier

Now consider the more realistic setting where the marginal density $\bar{f}$ of $X$ is unknown, but where we have access to an estimate $\hat{f}_{m}$ based on the unlabelled training set $\mathcal{T}^{\prime}_{m}$ . Of course, many different techniques are available, but for simplicity, we focus here on a kernel method. Let $K$ be a bounded kernel with $\int_{\mathbb{R}^{d}}K(x)\,dx=1$ , $\int_{\mathbb{R}^{d}}xK(x)\,dx=0$ , $\int_{\mathbb{R}^{d}}\|x\|^{2}|K(x)|\,dx<\infty$ , and let $R(K):=\int_{\mathbb{R}^{d}}K(x)^{2}\,dx$ . We further assume that $K(x)=Q(p(x))$ , where $p$ is a polynomial and $Q$ is a function of bounded variation. Now define a kernel density estimator of $\bar{f}$ , given by

[TABLE]

Motivated by the oracle local choice of $k$ in (5), for $\beta\in(0,1/2)$ and $B>0$ , let

[TABLE]

Our main result in this setting will require an additional smoothness condition on the marginal feature density $\bar{f}$ in order to ensure that $\hat{f}_{m}$ estimates it well. For $d\in\mathbb{N}$ , $\gamma\in(0,1]$ and $\lambda>0$ , let $\mathcal{Q}_{d,\gamma,\lambda}$ denote the class of distributions $P$ on $\mathbb{R}^{d}\times\{0,1\}$ whose marginal distribution $P_{X}$ is absolutely continuous with respect to Lebesgue measure with Radon–Nikodym derivative $\bar{f}$ satisfying $\|\bar{f}\|_{\infty}\leq\lambda$ and

[TABLE]

If $\gamma\in(1,2]$ , then we define $\mathcal{Q}_{d,\gamma,\lambda}$ to consist of distributions $P$ on $\mathbb{R}^{d}\times\{0,1\}$ whose marginal distribution $P_{X}$ is again absolutely continuous with Radon–Nikodym derivative $\bar{f}$ satisfying $\|\bar{f}\|_{\infty}\leq\lambda$ , but we now ask that $\bar{f}$ be differentiable, and that

[TABLE]

In Appendix B, we show that the distribution considered in Example 1 belongs to $\mathcal{Q}_{d,\gamma,\lambda}$ with $\gamma=2$ provided that $\lambda\geq 6\pi^{-d/2}\Gamma(3+d/2)$ .

Theorem 3.

Fix $d\in\mathbb{N}$ , $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)\in\Theta$ , $\gamma\in(0,2]$ and $\lambda>0$ such that $\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}\neq\emptyset$ . Let $m_{0}>0$ , let $0<A_{*}\leq A^{*}<\infty$ and $0<B_{*}\leq B^{*}<\infty$ , and let $h=h_{m}:=Am^{-1/(d+2\gamma)}$ for some $A>0$ .

(i) If $\rho>4$ and $\beta<4d(\rho-4)/\{\rho(d+4)^{2}\}$ ,

[TABLE]

uniformly for $A\in[A_{*},A^{*}]$ , $B\in[B_{*},B^{*}]$ and $m=m_{n}\geq m_{0}(n-1)^{2+d/\gamma}$ , where $B_{3}$ was defined in Theorem 2(i).

(ii) if $\rho\leq 4$ and $\beta<\min\{1/2,4/(d+4)\}$ , then for every $\epsilon>0$ ,

[TABLE]

uniformly for $A\in[A_{*},A^{*}]$ , $B\in[B_{*},B^{*}]$ and $m=m_{n}\geq m_{0}(n-1)^{2+d/\gamma}$ .

Examination of the proof of Theorem 3 reveals that the key property of our kernel estimator $\hat{f}_{m}$ of $\bar{f}$ is that there exists $\alpha>(1+d/4)\beta$ such that

[TABLE]

This observation would allow similar results to Theorem 3 to be proved for other versions of the semi-supervised nearest neighbour classifier, with alternative estimators of $\bar{f}$ in the definition of $\hat{k}_{\mathrm{SS}}(\cdot)$ , subject potentially to suitable modifications of the class $\mathcal{Q}_{d,\gamma,\lambda}$ . It is therefore not our intention to argue that the kernel density approach is superior to other methods of estimating the marginal density $\bar{f}$ .

5 Minimax lower bounds

Our main minimax lower bound is the following:

Theorem 4.

Fix $d\in\mathbb{N}$ , $\rho>0$ , $g\in\mathcal{G}$ with $r\mapsto r/g^{-1}(r)$ increasing for sufficiently small $r>0$ , and $\gamma\in(0,2]$ . There exist $\lambda_{*}>0$ , $\epsilon_{*}>0$ and $M_{*}>0$ , depending only on $d$ , such that for $\lambda\geq\lambda_{*}$ , $M_{0}\geq M_{*}$ , $\epsilon_{0}\in(0,\min(\epsilon_{*},1/(4M_{0}))]$ and $\ell\in\mathcal{L}$ with $\ell(\delta)\geq 2/\epsilon_{0}$ for all $\delta\in(0,\infty)$ , writing $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)\in\Theta$ , we can find $c=c(d,\theta,\gamma,\lambda)>0$ such that for all $n\in\mathbb{N}$ and all $\nu\geq 0$ , we have

[TABLE]

where $q=q_{n}\in(1/\|g\|_{\infty},\infty)$ is the unique solution to $\frac{q^{4+d+\nu(\rho+d)}}{g^{-1}(1/q)^{2}}=n$ and the infimum is taken over all measurable functions $C_{n}:(\mathbb{R}^{d}\times\{0,1\})^{\times n}\times\mathbb{R}^{d}\rightarrow\{0,1\}$ . In particular, for every $\epsilon>0$ , there exists $c=c(d,\theta,\gamma,\lambda,\epsilon)>0$ such that

[TABLE]

Remark 5.5.

The proof of this result also reveals that the lower bound holds if the classifier is allowed to depend on some unlabelled data or even the true marginal $X$ density $\bar{f}$ .

Example 5.6.

Consider the case where $g(\epsilon)=\exp(-1/\epsilon)$ , so $g\in\mathcal{G}$ . Then for $q\in(1,\infty)$ , we have $g^{-1}(1/q)=1/\log q$ , so for $n\in\mathbb{N}$ ,

[TABLE]

Thus, if $\rho>4$ , then we can take $\nu=0$ in Theorem 4 to obtain a minimax lower bound of order $n^{-4/(4+d)}/\log^{2}n$ ; on the other hand, if $\rho\leq 4$ , then we can take $\nu=\log^{1/2}n$ to obtain a minimax lower bound of order $n^{-(\frac{\rho}{\rho+d}+\epsilon)}$ , for every $\epsilon>0$ . Combining this result with Theorem 3, we see that for every $\rho\in(0,\infty)$ , our semi-supervised local- $k$ -nearest neighbour classifier attains the minimax optimal rate over the class $\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}$ up to polylogarithmic factors when $\rho>4$ and up to subpolynomial factors when $\rho\leq 4$ .

6 Proofs

The proofs of Theorems 1, 2 and 3 rely on the general asymptotic expansion presented in Theorem 6.7 below. We begin with some further notation. Define the $d\times n$ matrices $X^{n}:=(X_{1}\dots X_{n})$ and $x^{n}:=(x_{1}\dots x_{n})$ . Write

[TABLE]

and

[TABLE]

Here we have used the fact that the ordered labels $Y_{(1)},\ldots,Y_{(n)}$ are independent given $X^{n}$ , satisfying $\mathbb{P}(Y_{(i)}=1|X^{n})=\eta(X_{(i)})$ . Since $\eta$ takes values in $[0,1]$ it is clear that $0\leq\hat{\sigma}_{n}^{2}(x)\leq\frac{1}{4k_{\mathrm{L}}(x)}$ for all $x\in\mathbb{R}^{d}$ . Further, write $\mu_{n}(x):=\mathbb{E}\{\hat{S}_{n}(x)\}=\frac{1}{k_{\mathrm{L}}(x)}\sum_{i=1}^{k_{\mathrm{L}}(x)}\mathbb{E}\eta(X_{(i)})$ for the unconditional expectation of $\hat{S}_{n}(x)$ . Recall also that $p_{r}(x)=P_{X}\bigl{(}B_{r}(x)\bigr{)}$ .

6.1 A general asymptotic expansion

Let

[TABLE]

Further, for $x\in\mathbb{R}^{d}$ , let

[TABLE]

Recall that $\mathcal{S}=\{x\in\mathbb{R}^{d}:\eta(x)=1/2\}$ , and note that by Proposition G.17 in Appendix G, for $\epsilon>0$ , we can write

[TABLE]

Let

[TABLE]

and recall the definition of the function $a(\cdot)$ in (3).

Theorem 6.7.

Fix $d\in\mathbb{N}$ and $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)\in\Theta$ such that $\mathcal{P}_{d,\theta}\neq\emptyset$ . For $n$ sufficiently large, let $\mathcal{R}_{n}\subseteq\bigl{\{}x\in\mathbb{R}^{d}:\bar{f}(x)\geq\delta_{n}(x)\bigr{\}}$ be a $d$ -dimensional manifold. Write $\partial\mathcal{R}_{n}$ for the topological boundary of $\mathcal{R}_{n}$ , let $(\partial\mathcal{R}_{n})^{\epsilon}:=\partial\mathcal{R}_{n}+\epsilon\bar{B}_{1}(0)$ , and let $\mathcal{S}_{n}:=\mathcal{S}\cap\mathcal{R}_{n}$ . For $\beta\in(0,1/2)$ and $\tau>0$ define the class of functions

[TABLE]

Then for each $\beta\in(0,1/2)$ and each $\tau=\tau_{n}$ with $\tau_{n}\searrow 0$ , we have

[TABLE]

as $n\to\infty$ , where $\sup_{P\in\mathcal{P}_{d,\theta}}\sup_{k_{\mathrm{L}}\in K_{\beta,\tau}}|W_{n,1}|/\gamma_{n}(k_{\mathrm{L}})\rightarrow 0$ with

[TABLE]

and where $\limsup_{n\rightarrow\infty}\sup_{P\in\mathcal{P}_{d,\theta}}\sup_{k_{\mathrm{L}}\in K_{\beta,\tau}}|W_{n,2}|/P_{X}\bigl{(}(\partial\mathcal{R}_{n})^{\epsilon_{n}}\cap\mathcal{S}^{\epsilon_{n}}\bigr{)}\leq 1$ .

Proof 6.8 (Proof of Theorem 6.7).

First observe that

[TABLE]

The proof is presented in seven steps. We will see that the dominant contribution to the integral in (6.8) arises from a small neighbourhood about the Bayes decision boundary, i.e. the region $\mathcal{S}^{\epsilon_{n}}\cap\mathcal{R}_{n}$ . On $\mathcal{R}_{n}\setminus\mathcal{S}^{\epsilon_{n}}$ , the $k_{\mathrm{L}}$ nn classifier agrees with the Bayes classifier with high probability (asymptotically). More precisely, we show in Step 4 that

[TABLE]

for each $M>0$ , as $n\to\infty$ . In Steps 1, 2 and 3, we derive the key asymptotic properties of the bias, conditional (on $X^{n}$ ) bias and variance of $\hat{S}_{n}(x)$ respectively. In Step 5 we show that the integral over $\mathcal{S}^{\epsilon_{n}}\cap\mathcal{R}_{n}$ can be decomposed into an integral over $\mathcal{S}_{n}$ and one perpendicular to $\mathcal{S}$ . Step 6 is dedicated to combining the results of Steps 1 - 5; we derive the leading order terms in the asymptotic expansion of the integral in (6.8). Finally, we bound the remaining error terms to conclude the proof in Step 7, which is presented in Appendix E. To ease notation, where it is clear from the context, we write $k_{\mathrm{L}}$ in place of $k_{\mathrm{L}}(x)$ .

Step 1: Let $\mu_{n}(x):=\mathbb{E}\{\hat{S}_{n}(x)\}$ , and for $x_{0}\in\mathcal{S}$ and $t\in\mathbb{R}$ , write $x=x(x_{0},t):=x_{0}+t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}$ . We show that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ . Write

[TABLE]

where we show in Step 7 that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ .

The density of $X_{(i)}-x$ at $u\in\mathbb{R}^{d}$ is given by

[TABLE]

where $p_{\|u\|}=p_{\|u\|}(x)$ and $p_{\|u\|}^{n-1}(i-1)$ denotes the probability that a $\mathrm{Bin}(n-1,p_{\|u\|})$ random variable equals $i-1$ . Now let

[TABLE]

We show in Step 7 that

[TABLE]

for each $M>0$ , as $n\to\infty$ . It follows from (11) and (13), together with the upper bound on $\sup_{x\in\mathcal{S}^{2\epsilon_{0}}}\|\dot{\eta}(x)\|$ in (A.3) that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $i\in\{1,\ldots,k_{\mathrm{L}}\}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ . Similarly, using the upper bound on $\sup_{x\in\mathcal{S}^{2\epsilon_{0}}}\|\ddot{\eta}(x)\|_{\mathrm{op}}$ in (A.3),

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $i\in\{1,\ldots,k_{\mathrm{L}}\}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ . Hence, summing over $i$ , we see that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $i\in\{1,\ldots,k_{\mathrm{L}}\}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ , where $q_{\|u\|}^{n-1}(k_{\mathrm{L}})$ denotes the probability that a $\mathrm{Bin}(n-1,p_{\|u\|})$ random variable is less than $k_{\mathrm{L}}$ . Let $n_{0}\in\mathbb{N}$ be large enough that

[TABLE]

for $n\geq n_{0}$ . That this is possible follows from the fact that, for $\epsilon_{n}<\epsilon_{0}$ ,

[TABLE]

By a Taylor expansion of $\bar{f}$ and assumption (A.2), for all $x_{0}\in\mathcal{S}_{n}$ , $|t|<\epsilon_{n}$ , $\|u\|<r_{n}$ and $n\geq n_{0}$ ,

[TABLE]

Hence, for $x_{0}\in\mathcal{S}_{n}$ , $|t|<\epsilon_{n}$ , $r<r_{n}$ and $n\geq n_{0}$ ,

[TABLE]

Now, for $v\in B_{1}(0)$ , $x_{0}\in\mathcal{S}_{n}$ , $|t|<\epsilon_{n}$ and $n\geq n_{0}$ ,

[TABLE]

where

[TABLE]

It follows from (6.8) that there exists $n_{1}\in\mathbb{N}$ such that, for all $x_{0}\in\mathcal{S}_{n}$ , $|t|<\epsilon_{n}$ , $\|v\|^{d}\in(0,1/2-1/\log{((n-1)/k_{\mathrm{L}}(x_{0}))}]$ and $n\geq n_{1}$ ,

[TABLE]

Similarly, for all $\|v\|^{d}\in[1/2+1/\log((n-1)/k_{\mathrm{L}}(x_{0})),1)$ and $n\geq n_{1}$ ,

[TABLE]

Hence, by Bernstein’s inequality, we have that for each $M>0$ ,

[TABLE]

and

[TABLE]

We conclude that

[TABLE]

where

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ .

Step 2: Recall that $\hat{\sigma}_{n}^{2}(x,x^{n})=\mathrm{Var}\{\hat{S}_{n}(x)|X^{n}=x^{n}\}$ . We show that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ . Recall that

[TABLE]

Let $n_{2}\in\mathbb{N}$ be large enough that $1-c_{n}\epsilon_{n}-\frac{d+1}{d+2}c_{n}\epsilon_{n}^{2}\geq\epsilon_{0}$ for $n\geq n_{2}$ . Then for $n\geq\max\{n_{0},n_{2}\}$ , $P\in\mathcal{P}_{d,\theta}$ , $r<\epsilon_{n}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ , we have by (A.2) and a very similar argument to that in (6.8) that

[TABLE]

Now suppose that $z_{1},\dots,z_{N}\in\mathcal{R}_{n}\cup\mathcal{S}_{n}^{\epsilon_{n}}$ are such that $\|z_{j}-z_{\ell}\|\geq\epsilon_{n}/6$ for all $j\neq\ell$ , but $\sup_{x\in\mathcal{R}_{n}\cup\mathcal{S}_{n}^{\epsilon_{n}}}\min_{j=1,\dots,N}\|x-z_{j}\|<\epsilon_{n}/6$ . We have by (A.2) that

[TABLE]

For each $j=1,\ldots,N$ , choose

[TABLE]

Now, given $x\in\mathcal{R}_{n}\cup\mathcal{S}_{n}^{\epsilon_{n}}$ , let $j_{0}:=\operatorname*{argmin}_{j}\|x-z_{j}\|$ , so that $B_{\epsilon_{n}/6}(z_{j_{0}}^{\prime})\subseteq B_{\epsilon_{n}/2}(x)$ . Thus, if there are at least $k_{\mathrm{L}}(z_{j}^{\prime})$ points among $\{x_{1},\ldots,x_{n}\}$ inside each of the balls $B_{\epsilon_{n}/6}(z_{j}^{\prime})$ , then for every $x\in\mathcal{R}_{n}\cup\mathcal{S}_{n}^{\epsilon_{n}}$ there are at least $k_{\mathrm{L}}(x)$ of them in $B_{\epsilon_{n}/2}(x)$ . Moreover by (6.8), (19) and (A.2),

[TABLE]

for all $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ and $n\geq n_{3}$ , say. Define $A_{k_{\mathrm{L}}}:=\bigl{\{}\|X_{(k_{\mathrm{L}})}(x)-x\|<\epsilon_{n}/2\ \mbox{for all}\ x\in\mathcal{R}_{n}\cup\mathcal{S}_{n}^{\epsilon_{n}}\bigr{\}}$ . Then by a standard binomial tail bound (Shorack and Wellner, 1986, Equation (6), p. 440), for $n\geq n_{3}$ and any $M>0$ ,

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k_{\mathrm{L}}\in K_{\beta,\tau}$ . Now, for $3\epsilon_{n}/2<2\epsilon_{0}$ ,

[TABLE]

It follows that

[TABLE]

as $n\to\infty$ , uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ . The claim (18) follows from (6.8) and (21).

Step 3: In this step, we emphasise the dependence of $\hat{\mu}_{n}(x,x^{n})=\mathbb{E}\{\hat{S}_{n}(x)|X^{n}=x^{n}\}$ on $k_{\mathrm{L}}$ by writing it as $\hat{\mu}_{n}^{(k_{\mathrm{L}})}(x,x^{n})$ . We show that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ . We will write $X^{n,j}\ :=\ (X_{1}\ldots X_{j-1}\ X_{j+1}\ldots X_{n})$ , considered as a random $d\times(n-1)$ matrix, so that

[TABLE]

It follows from the Efron–Stein inequality (e.g. Boucheron, Lugosi and Massart, 2013, Theorem 3.1) that

[TABLE]

Recall the definition of $r_{n}$ given in (12). Now observe that, for $\max(\epsilon_{n},r_{n})\leq\epsilon_{0}$ and all $M>0$ we have that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ . The final inequality here follows from similar arguments to those used to bound $R_{1}$ . Now (22) follows from (6.8) and (6.8).

Step 4: We show that

[TABLE]

for each $M>0$ , as $n\to\infty$ . First, by (A.3) and Proposition G.17 in Section G.2, there exists $c_{0}>0$ such that for every $r\in(0,\epsilon_{0}]$ , $P\in\mathcal{P}_{d,\theta}$ and $k_{\mathrm{L}}\in K_{\beta,\tau}$ ,

[TABLE]

Hence, on the event $A_{k_{\mathrm{L}}}$ , for $\epsilon_{n}<\epsilon_{0}$ and $x\in\mathcal{R}_{n}\setminus\mathcal{S}^{\epsilon_{n}}$ , all of the $k_{\mathrm{L}}$ nearest neighbours of $x$ are on the same side of $\mathcal{S}$ , so

[TABLE]

Now, conditional on $X^{n}$ , $\hat{S}_{n}(x)$ is the sum of $k_{\mathrm{L}}(x)$ independent terms. Therefore, by Hoeffding’s inequality,

[TABLE]

for every $M>0$ . This completes Step 4.

Step 5: It is now convenient to be more explicit in our notation, by writing $x_{0}^{t}:=x_{0}+t\dot{\eta}(x_{0})/\|\dot{\eta}(x_{0})\|$ . We also let

[TABLE]

Recall that $\mathcal{S}_{n}:=\mathcal{S}\cap\mathcal{R}_{n}$ and let

[TABLE]

We show that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k_{\mathrm{L}}\in K_{\beta,\tau}$ , and that for all $n\geq 2$ ,

[TABLE]

Now by Proposition G.19 in Section G.2, for $\epsilon_{n}\leq\epsilon_{0}$ , the map $x(x_{0},t)=x_{0}^{t}$ is a diffeomorphism from $\mathcal{S}_{n}\times(-\epsilon_{n},\epsilon_{n})$ to $\mathcal{S}_{n}^{\epsilon_{n}}$ , where

[TABLE]

Furthermore, for such $n$ , and $|t|<\epsilon_{n}$ , $\text{sgn}\{\eta(x_{0}^{t})-1/2\}=\text{sgn}(t)$ . It follows from this and (G.3) in Section G.3 that

[TABLE]

where $B$ is defined in (57) in Section G.2, and $\det(I+tB)=1+o(1)$ as $n\rightarrow\infty$ , uniformly for $P\in\mathcal{P}_{d,\theta}$ , $x_{0}\in\mathcal{S}$ and $t\in(-\epsilon_{n},\epsilon_{n})$ . Now observe that $(\mathcal{S}^{\epsilon_{n}}\cap\mathcal{R}_{n})\setminus\mathcal{S}_{n}^{\epsilon_{n}}\subseteq(\partial\mathcal{R}_{n})^{\epsilon_{n}}\cap\mathcal{S}^{\epsilon_{n}}$ and $\mathcal{S}_{n}^{\epsilon_{n}}\setminus(\mathcal{S}^{\epsilon_{n}}\cap\mathcal{R}_{n})\subseteq(\partial\mathcal{R}_{n})^{\epsilon_{n}}\cap\mathcal{S}^{\epsilon_{n}}$ . We deduce from this and the definition of $W_{n,2}$ that (25) holds.

Step 6: The last step in the main argument is to show that

[TABLE]

as $n\to\infty$ , uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k_{\mathrm{L}}\in K_{\beta,\tau}$ . First observe that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k_{\mathrm{L}}\in K_{\beta,\tau}$ . Now, write $\mathbb{P}\{\hat{S}_{n}(x_{0}^{t})<1/2\}-\mathbbm{1}_{\{t<0\}}=\mathbb{E}[\mathbb{P}\{\hat{S}_{n}(x_{0}^{t})<1/2|X^{n}\}-\mathbbm{1}_{\{t<0\}}].$ Note that, given $X^{n}$ , $\hat{S}_{n}(x)=\frac{1}{k_{\mathrm{L}}(x)}\sum_{i=1}^{k_{\mathrm{L}}(x)}\mathbbm{1}_{\{Y_{(i)}=1\}}$ is the sum of $k_{\mathrm{L}}(x)$ independent Bernoulli variables, satisfying $\mathbb{P}(Y_{(i)}=1|X^{n})=\eta(X_{(i)})$ . Let $\Phi$ be the standard normal distribution function, and let

[TABLE]

We can write

[TABLE]

where we show in Step 7 that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k_{\mathrm{L}}\in K_{\beta,\tau}$ . Then, substituting $u=2k_{\mathrm{L}}(x_{0})^{1/2}t$ , we see that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ and $x_{0}\in\mathcal{S}_{n}$ . The conclusion follows by integrating with respect to $d\mathrm{Vol}^{d-1}$ over $\mathcal{S}_{n}$ .

Step 7: It remains to bound the error terms $R_{1},R_{2},R_{5}$ and $R_{6}$ – these bounds are presented in Appendix E.

6.2 Proof of Theorem 1

Proof 6.9 (Proof of Theorem 1).

Let $k\in K_{\beta}$ , and note that since $k_{\mathrm{L}}(x)=k$ is constant, we have that $c_{n}=\ell\bigl{(}k/(n-1)\bigr{)}$ , and $\delta_{n}\,=\,\frac{k}{n-1}c_{n}^{d}\log^{d}(\frac{n-1}{k}).$ Now let

[TABLE]

and observe that by Berrett et al. (2019, Lemma 10(i)), for $P\in\mathcal{P}_{d,\theta}$ ,

[TABLE]

It follows that we can find $n_{0}\in\mathbb{N}$ be large enough that $\mathcal{R}_{n}$ is non-empty for all $P\in\mathcal{P}_{d,\theta}$ , $k\in K_{\beta}$ and $n\geq n_{0}$ , so that, by Assumption (A.1), for $n\geq n_{0}$ it is an open subset of $\mathbb{R}^{d}$ , and therefore a $d$ -dimensional manifold. Let $\mathcal{S}_{n}:=\mathcal{S}\cap\mathcal{R}_{n}$ ,

[TABLE]

and

[TABLE]

Recalling the definition of $\epsilon_{n}$ in (8), for $n\geq n_{0}$ , we may apply Theorem 6.7 with $k_{\mathrm{L}}(x)=k$ for all $x\in\mathbb{R}^{d}$ to deduce that

[TABLE]

where $\sup_{P\in\mathcal{P}_{d,\theta}}\sup_{k\in K_{\beta}}|W_{n,1}|/\gamma_{n}(k)\rightarrow 0$ and where

[TABLE]

We now show that, under the conditions of part (i), $B_{1,n}$ and $B_{2,n}$ are well approximated by integrals over the whole of the manifold $\mathcal{S}$ , and that these integrals are uniformly bounded. Given $x_{0}\in\mathcal{S}\cap\{x\in\mathbb{R}^{d}:\bar{f}(x)>0\}$ , define $\epsilon_{0}(x_{0}):=\min\bigl{\{}1,\frac{\epsilon_{0}\log 2}{2d},\frac{1}{4\ell(\bar{f}(x_{0}))}\bigr{\}}$ . Then for any $t\in[-\epsilon_{0}(x_{0}),\epsilon_{0}(x_{0})]$ we have by (A.2) and Cauchy–Schwarz that

[TABLE]

Moreover, writing $\lambda_{1},\ldots,\lambda_{d}$ for the eigenvalues of the matrix $B$ defined in (57), for $t\in[-\epsilon_{0}(x_{0}),\epsilon_{0}(x_{0})]$ , we have

[TABLE]

so $\det(I+tB)\geq 1/2$ . Hence, for any $\tau\in(d/(\rho+d),1]$ there exists $A_{\tau}=A_{\tau}(d,\theta)>0$ such that, writing $\bar{\tau}:=\frac{1}{2}(\tau+\frac{d}{\rho+d})$ , by (G.3), Hölder’s inequality and (A.4), we have

[TABLE]

Now, by Assumption (A.3), for any $P\in\mathcal{P}_{d,\theta}$ ,

[TABLE]

Moreover, writing $\bar{\tau}:=\frac{1}{2}(1+\frac{d}{\rho+d})$ ,

[TABLE]

By Assumptions (A.2), (A.3), (6.9) and the fact that $\rho/(\rho+d)>4/d$ , we have, writing $\bar{\tau}:=\frac{1}{2}(1-4/d+\frac{d}{\rho+d})$ , that

[TABLE]

Similarly,

[TABLE]

A similar argument shows that $\gamma_{n}(k)=O\bigl{(}1/k+(k/n)^{4/d}\bigr{)}$ , uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k\in K_{\beta}$ .

Finally, we bound $P_{X}\bigl{(}(\partial\mathcal{R}_{n})^{\epsilon_{n}}\cap\mathcal{S}^{\epsilon_{n}}\bigr{)}$ and $R_{\mathcal{R}_{n}^{c}}(\hat{C}_{n}^{k\mathrm{nn}})-R_{\mathcal{R}_{n}^{c}}(C^{\mathrm{Bayes}})$ . Suppose that $x\in(\partial\mathcal{R}_{n})^{\epsilon_{n}}\cap\mathcal{S}^{\epsilon_{n}}$ . Then there exists $z\in\partial\mathcal{R}_{n}\cap B_{\epsilon_{n}}(x)\cap\mathcal{S}^{2\epsilon_{n}}$ with $\bar{f}(z)=\delta_{n}$ . By Assumption (A.2) we have that

[TABLE]

Thus there exists $n_{1}\in\mathbb{N}$ such that $(\partial\mathcal{R}_{n})^{\epsilon_{n}}\cap\mathcal{S}^{\epsilon_{n}}\subseteq\{x\in\mathbb{R}^{d}:\bar{f}(x)\leq 2\delta_{n}\}$ for $n\geq n_{1}$ . By the moment assumption in (A.4) and Hölder’s inequality, observe that for any $\alpha\in(0,1)$ , $P\in\mathcal{P}_{d,\theta}$ , $n\geq n_{1}$ and $\epsilon>0$ ,

[TABLE]

uniformly for $k\in K_{\beta}$ . Moreover,

[TABLE]

so the same bound (6.9) applies. Since $\rho/(\rho+d)>4/d$ and $\alpha\in(0,1)$ was arbitrary, this completes the proof of part (i).

For part (ii), in contrast to part (i), the dominant contribution to the excess risk could now arise from the tail of the distribution. First, as in part (i), we have $B_{1,n}\to B_{1}\leq A_{1}^{\prime}/(4\epsilon_{0}M_{0})$ , uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k\in K_{\beta}$ . Furthermore, using Assumption (A.3), (6.9) and the fact that $4/d\geq\rho/(\rho+d)$ , we see that, for any $\epsilon^{\prime}\in(0,\rho/(\rho+d)]$ ,

[TABLE]

for every $\epsilon\in\bigl{(}\epsilon^{\prime},\rho/(\rho+d)\bigr{]}$ , uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k\in K_{\beta}$ , where the final conclusion follows from the fact that $\sup_{P\in\mathcal{P}_{d,\theta}}\sup_{x_{0}\in\mathcal{S}_{n}}a^{2}(x_{0})/c_{n}^{2}$ is bounded. We can also bound $\gamma_{n}(k)$ by the same argument, so the result follows in the same way as in part (i).

6.3 Proofs of results from Section 4

Proof 6.10 (Proof of Theorem 2).

Recall that

[TABLE]

and define

[TABLE]

where $c_{n}:=\sup_{x_{0}\in\mathcal{S}:\bar{f}(x_{0})\geq k_{\mathrm{O}}(x_{0})/(n-1)}\ell\bigl{(}\bar{f}(x_{0})\bigr{)}$ . For $\alpha\in((1+d/4)\beta,1)$ let

[TABLE]

Then there exists $n_{0}\in\mathbb{N}$ such that for $n\geq n_{0}$ we have $\mathcal{R}_{n}\subseteq\bigl{\{}x\in\mathbb{R}^{d}:\bar{f}(x)\geq\delta_{n,\mathrm{O}}(x)\bigr{\}}$ for all $P\in\mathcal{P}_{d,\theta}$ and $B\in[B_{*},B^{*}]$ , and by Assumption (A.1) and (27), we then have that $\mathcal{R}_{n}$ is a $d$ -dimensional manifold. There exists $n_{1}\in\mathbb{N}$ such that for all $n\geq n_{1}$ , $P\in\mathcal{P}_{d,\theta}$ , $B\in[B_{*},B^{*}]$ and $x\in\mathcal{R}_{n}\cap\mathcal{S}^{\epsilon_{0}}$ we have that $k_{\mathrm{O}}(x)=\bigl{\lfloor}B\bigl{\{}\bar{f}(x)(n-1)\bigr{\}}^{4/(d+4)}\bigr{\rfloor}$ . By (A.2), we therefore have that $k_{\mathrm{O}}\in K_{\beta,\tau}$ for some $\tau=\tau_{n}$ (which does not depend on $P\in\mathcal{P}_{d,\theta}$ or $B\in[B_{*},B^{*}]$ ) with $\tau_{n}\searrow 0$ .

By a similar argument to that in (29), there exists $n_{2}\in\mathbb{N}$ such that for $n\geq n_{2}$ , $P\in\mathcal{P}_{d,\theta}$ , $B\in[B_{*},B^{*}]$ and $x\in(\partial\mathcal{R}_{n})^{\epsilon_{n}}\cap\mathcal{S}^{\epsilon_{n}}$ , we have $\bar{f}(x)\leq 2(n-1)^{-(1-\alpha)}$ . But, by Markov’s inequality and Hölder’s inequality, for $\tilde{\alpha}\in(0,1)$ and any $P\in\mathcal{P}_{d,\theta}$ ,

[TABLE]

Thus, if $\rho>4$ , then we can choose $\alpha\in((1+d/4)\beta,d(\rho-4)/\{\rho(d+4)\})$ and $\tilde{\alpha}<1-4(\rho+d)/\{\rho(1-\alpha)(d+4)\}$ in (6.10) to conclude that

[TABLE]

Moreover, writing

[TABLE]

by very similar arguments to those given in the proof of Theorem 1, $B_{3,n}\rightarrow B_{3}$ and $\gamma_{n}(k_{\mathrm{O}})=O(n^{-4/(d+4)})$ as $n\rightarrow\infty$ , both uniformly for $P\in\mathcal{P}_{d,\theta}$ and $B\in[B_{*},B^{*}]$ . The proof of part (i) therefore follows from Theorem 6.7.

On the other hand, if $\rho\leq 4$ , then choosing both $\tilde{\alpha}>0$ and $\alpha>(1+d/4)\beta$ to be sufficiently small, we find from (6.10) that

[TABLE]

for every $\epsilon>0$ , uniformly for $P\in\mathcal{P}_{d,\theta}$ and $B\in[B_{*},B^{*}]$ . After another application of Theorem 6.7, this proves part (ii).

Proof 6.11 (Proof of Theorem 3).

We prove parts (i) and (ii) of the theorem simultaneously, by appealing to the corresponding arguments in the proof of Theorem 2. First, as in the proof of Theorem 2, for $\alpha\in\bigl{(}(1+d/4)\beta,1\bigr{)}$ , we define $\mathcal{R}_{n}=\{x\in\mathbb{R}^{d}:\bar{f}(x)>(n-1)^{-(1-\alpha)}\}\cap\mathcal{X}_{\bar{f}}$ and introduce the following class of functions: for $\tau>0$ , let

[TABLE]

Let $\tau=\tau_{n}:=2(n-1)^{-\alpha/2}$ . We first show that $\hat{f}_{m}\in\mathcal{F}_{n,\tau}$ with high probability. For $x\in\mathcal{R}_{n}$ ,

[TABLE]

Now

[TABLE]

To bound the first term in (32), by Giné and Guillou (2002, Corollary 2.2), there exist $C,L>0$ , such that

[TABLE]

for all $s\in\Bigl{[}\frac{C\|\bar{f}\|_{\infty}^{1/2}R(K)^{1/2}}{A^{d/2}}\log^{1/2}\Bigl{(}\frac{\|K\|_{\infty}m^{d/(2(d+2\gamma))}}{\|\bar{f}\|_{\infty}^{1/2}A^{d/2}R(K)^{1/2}}\Bigr{)},\frac{C\|\bar{f}\|_{\infty}R(K)m^{\gamma/(d+2\gamma)}}{\|K\|_{\infty}}\Bigr{]}$ and $A\in[A_{*},A^{*}]$ .

Recall that for $P\in\mathcal{P}_{d,\theta}$ , we have $\|\bar{f}\|_{\infty}\leq\lambda$ and $\|\bar{f}\|_{\infty}$ also satisfies the lower bound in (27). Hence, by applying the bound in (33) with $s=s_{0}:=m^{\gamma/(d+2\gamma)}/(n-1)^{1-\alpha/2}$ , since $m\geq m_{0}(n-1)^{d/\gamma+2}$ , we have that there exists $n_{*}\in\mathbb{N}$ , not depending on $P\in\mathcal{P}_{d,\theta}$ or $A\in[A_{*},A^{*}]$ such that for $n\geq n_{*}$ ,

[TABLE]

for all $M>0$ , uniformly for $A\in[A_{*},A^{*}]$ . For the second term in (32), by a Taylor expansion, we have that for all $P\in\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}$ and $A\in[A_{*},A^{*}]$ ,

[TABLE]

It follows that, writing $\tau_{0}:=2(n-1)^{-\alpha/2}$ , we have

[TABLE]

for all $M>0$ .

Now, for $\tilde{f}\in\mathcal{F}_{n,\tau_{0}}$ , let

[TABLE]

Let $c_{n}:=\sup_{x_{0}\in\mathcal{S}:\bar{f}(x_{0})\geq k_{\tilde{f}}(x_{0})/(n-1)}\ell\bigl{(}\bar{f}(x_{0})\bigr{)}$ , and let

[TABLE]

Then there exists $n_{0}\in\mathbb{N}$ such that for $n\geq n_{0}$ and $\tilde{f}\in\mathcal{F}_{n,\tau_{0}}$ , we have $\mathcal{R}_{n}\subseteq\bigl{\{}x\in\mathbb{R}^{d}:\bar{f}(x)\geq\delta_{n,\tilde{f}}(x)\bigr{\}}$ and $k_{\tilde{f}}\in K_{\beta,\tau_{0}}$ . We can therefore apply Theorem 6.7 (similarly to the application in the proof of Theorem 2) to conclude that for every $\epsilon>0$ ,

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,\gamma,\lambda}$ and $\tilde{f}\in\mathcal{F}_{n,\tau_{0}}$ , where $B_{3,n}$ was defined in the proof of Theorem 2. The proof of both parts (i) and (ii) is now completed by following the relevant steps in the proof of Theorem 2.

Acknowledgements

The authors are grateful to the anonymous reviewers, whose constructive comments helped to improve the paper. We would also like to thank the Isaac Newton Institute for Mathematical Sciences for support and hospitality during the programme ‘Statistical Scalability’ when work on this paper was undertaken. This work was supported by EPSRC grant number EP/R014604/1.

Appendix A The relationship between our classes and the margin assumption

Recall from Mammen and Tsybakov (1999) that a distribution $P$ on $\mathbb{R}^{d}\times\{0,1\}$ with marginal $P_{X}$ on $\mathbb{R}^{d}$ and regression function $\eta$ satisfies a margin assumption with parameter $\alpha>0$ if there exists $C>0$ such that

[TABLE]

for all sufficiently small $s>0$ . The following lemma clarifies the relationship between our classes and the margin assumption.

Lemma A.12.

Let $P\in\mathcal{P}_{d,\theta}$ for some $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)\in\Theta$ . Then $P$ satisfies a margin assumption with parameter $\alpha=1$ .

Proof A.13.

By the final part of (A.3), we have

[TABLE]

Now, by Proposition G.17 in Section G.2, for $x\in\mathcal{S}^{\epsilon_{0}}$ , there exists $x_{0}\in\mathcal{S}$ and $t\in(-\epsilon_{0},\epsilon_{0})$ such that $x=x_{0}+t\dot{\eta}(x_{0})/\|\dot{\eta}(x_{0})\|$ . Thus, by a Taylor expansion,

[TABLE]

We deduce as in Step 5 of the proof of Theorem 6.7 that there exists $s_{0}=s_{0}(d,\theta)>0$ such that for all $s\in(0,s_{0}]$ ,

[TABLE]

where the final bound follows from (6.9) in the main text. For the second term in (A.13), we exploit the fact that since $\ell\in\mathcal{L}$ , there exists $A=A(d,\theta)>0$ such that $\ell(\delta)\leq A\delta^{-\frac{\rho}{2(\rho+d)}}$ for all $\delta>0$ . Hence, arguing as in (6.9) in the main text, we find that

[TABLE]

The result follows from (A.13), (A.13) and (A.13).

Appendix B Example 1 from the main text

Recall that we consider the distribution $P$ on $\mathbb{R}^{d}\times\{0,1\}$ for which $\bar{f}(x)=\frac{\Gamma(3+d/2)}{2\pi^{d/2}}(1-\|x\|^{2})^{2}\mathbbm{1}_{\{x\in B_{1}(0)\}}$ and $\eta(x)=\min(\|x\|^{2},1)$ . Since $\bar{f}$ is continuous on all of $\mathbb{R}^{d}$ , it is clear that (A.1) is satisfied.

Now, $\mathcal{S}=\{x\in\mathbb{R}^{d}:\|x\|=2^{-1/2}\}$ and clearly $\mathcal{S}\cap\{x\in\mathbb{R}^{d}:\bar{f}(x)>0\}$ is non-empty. For all $x_{0}\in\mathcal{S}$ we have that $\bar{f}(x_{0})=\frac{\Gamma(3+d/2)}{8\pi^{d/2}}\leq M_{0}$ . Since $\epsilon_{0}\leq 1/10$ we have that $\mathcal{S}^{\epsilon_{0}}\subseteq B_{9/10}(0)\setminus B_{3/5}(0)$ and thus $\bar{f}$ is twice continuously differentiable on $\mathcal{S}^{\epsilon_{0}}$ . Differentiating $\bar{f}$ twice on $B_{1}(0)$ , we have that $\dot{\bar{f}}(x)=-2\pi^{-d/2}\Gamma(3+d/2)(1-\|x\|^{2})x$ and

[TABLE]

Thus, for $x_{0}\in\mathcal{S}$ , we have $\|\dot{\bar{f}}(x_{0})\|/\bar{f}(x_{0})=2^{5/2}\leq\ell(\bar{f}(x_{0}))$ . We also have that, for any $x\in B_{1}(0)$ ,

[TABLE]

so that $\sup_{u\in B_{\epsilon_{0}}(0)}\|\ddot{\bar{f}}(x_{0}+u)\|_{\mathrm{op}}/\bar{f}(x_{0})<48\leq\ell(\bar{f}(x_{0}))$ for any $x_{0}\in\mathcal{S}$ . Finally for (A.2) we consider the cases $x\in B_{1}(0)\setminus B_{\epsilon_{0}}(0)$ and $x\in B_{\epsilon_{0}}(0)$ separately. If $x\in B_{1}(0)\setminus B_{\epsilon_{0}}(0)$ then, for $r\in(0,\epsilon_{0}]$ , at least a proportion $2^{-d}$ of the ball $B_{r}(x)$ is closer to the origin than $x$ , and thus has larger density. This gives us that, for such $x$ and $r$ , $p_{r}(x)\geq 2^{-d}a_{d}r^{d}\bar{f}(x)\geq\epsilon_{0}a_{d}r^{d}\bar{f}(x)$ . When $x\in B_{\epsilon_{0}}(0)$ and $r\in(0,\epsilon_{0}]$ we instead have that

[TABLE]

We now turn to condition (A.3). First, for any $x_{0}\in\mathcal{S}$ we have that $\|\dot{\eta}(x_{0})\|=\|2x_{0}\|=2^{1/2}\geq\epsilon_{0}M_{0}$ . For $x\in\mathcal{S}^{2\epsilon_{0}}$ we have that $\|\dot{\eta}(x)\|\leq 2(2^{-1/2}+2\epsilon_{0})\leq M_{0}$ and $\|\ddot{\eta}(x)\|_{\mathrm{op}}=\|2I\|_{\mathrm{op}}=2\leq M_{0}$ . Since $\ddot{\eta}$ is constant on $\mathcal{S}^{2\epsilon_{0}}$ it is trivially true that

[TABLE]

for any $g\in\mathcal{G}$ . Now for $x\in\mathbb{R}^{d}\setminus\mathcal{S}^{\epsilon_{0}}$ we have that

[TABLE]

Since the support of $\bar{f}$ is equal to $B_{1}(0)$ , we have that $\int_{\mathbb{R}^{d}}\|x\|^{\rho}dP_{X}(x)\leq 1\leq M_{0}$ , so (A.4) is satisfied.

We finally check (A.5) to show that $P\in\mathcal{Q}_{d,2,\lambda}$ for $\lambda\geq 6\pi^{-d/2}\Gamma(3+d/2)$ . First, it is clear that $\|\bar{f}\|_{\infty}\leq\lambda$ . Now, for any $x,y\in\mathbb{R}^{d}$ we have that

[TABLE]

Appendix C Example 2 from the main text

Proof C.14 (Proof of claim in Example 2).

Fix $\epsilon>0$ and $k\in K_{\beta}$ , let

[TABLE]

and for $\gamma>0$ , let

[TABLE]

Now, for $\epsilon\beta\log{n}>4$ and $\gamma\in[2,\epsilon\log(n/k)/2)$ ,

[TABLE]

where $T\sim\mathrm{Bin}(n,p_{\gamma}^{*})$ , $T^{\prime}\sim\mathrm{Bin}(n,p_{*})$ ,

[TABLE]

Therefore, there exists $n_{0}\in\mathbb{N}$ such that $np_{*}-(k+1)\geq k/2$ and $k+1-np_{\gamma}^{*}\geq k/2$ for all $k\in K_{\beta}$ , $\gamma\in[2,\epsilon\log(n/k)/2)$ and $n\geq n_{0}$ . It follows by Bernstein’s inequality that $\sup_{k\in K_{\beta}}\sup_{\gamma\in[2,\epsilon\log(n/k)/2)}\mathbb{P}(B_{k,\gamma}^{c})=O(n^{-M})$ for every $M>0$ .

Now, for $x=(x_{1},x_{2})\in\mathcal{T}_{n}$ , $\epsilon\beta\log{n}>4$ and $\gamma\in[2,x_{2}-1)$ , we have that

[TABLE]

Our next observation is that for $\gamma\in[0,\infty)$ and $x_{(k+1)}\in\mathbb{R}^{d}$ such that $\|x_{(k+1)}-x\|=\gamma$ , we have that $(X_{(1)},Y_{(1)},\ldots,X_{(k)},Y_{(k)})|(X_{(k+1)}=x_{(k+1)})\!\stackrel{{\scriptstyle d}}{{=}}(\tilde{X}_{(1)},\tilde{Y}_{(1)},\ldots,\tilde{X}_{(k)},\tilde{Y}_{(k)})$ , where the pairs $(\tilde{X}_{1},\tilde{Y}_{1}),\ldots,(\tilde{X}_{k},\tilde{Y}_{k})$ are independent and identically distributed, and then $(\tilde{X}_{(1)},\tilde{Y}_{(1)}),\ldots,(\tilde{X}_{(k)},\tilde{Y}_{(k)})$ is a reordering such that $\|\tilde{X}_{(1)}-x\|\leq\ldots\leq\|\tilde{X}_{(k)}-x\|$ . Here $\tilde{X}_{1}\stackrel{{\scriptstyle d}}{{=}}X|(\|X\!-x\|\leq\gamma)$ and $\mathbb{P}(\tilde{Y}_{1}=1|\tilde{X}_{1}=x)=\eta(x)$ . Writing $\tilde{S}_{n}(x):=\frac{1}{k}\sum_{i=1}^{k}\mathbbm{1}_{\{\tilde{Y}_{i}=1\}}$ we therefore have by Hoeffding’s inequality that, for $x\in\mathcal{T}_{n}$ , $\epsilon\beta\log{n}>4$ and $\|x_{(k+1)}-x\|\in[2,x_{2}-1)$ ,

[TABLE]

for all $M>0$ , uniformly for $k\in K_{\beta}$ . Writing $P_{(k+1)}$ for the marginal distribution of $X_{(k+1)}$ , we deduce that

[TABLE]

for all $M>0$ , uniformly for $k\in K_{\beta}$ . We conclude that for every $M>0$ ,

[TABLE]

uniformly for $k\in K_{\beta}$ . The claim (4) follows from this together with Theorem 1(ii).

Appendix D Proof of Theorem 4

Proof D.15 (Proof of Theorem 4).

For an integer $q\geq 3$ and $\nu\geq 0$ , define a grid on $\mathbb{R}^{d}$ by

[TABLE]

Now, for $x\in\mathbb{R}^{d}$ , let $n_{q}(x)$ be the closest point to $x$ among those in $G_{q,\nu}$ (if there are multiple points, pick the one that is smallest in the lexicographic ordering). Let $m:=\lceil q^{\nu}\rceil^{d}q^{d-1}$ and define closed Euclidean balls $\mathcal{X}_{1},\ldots,\mathcal{X}_{m}$ in $\mathbb{R}^{d}$ of radius $1/(2q)$ , where the $l$ th ball is centered at the $l$ th grid point in the lexicographic ordering.

Writing $[z]$ for the closest integer to $z$ (where we round half-integers to the nearest even integer), define the ‘saw-tooth’ function $\eta_{0}:\mathbb{R}^{d}\rightarrow[3/8,5/8]$ , by $\eta_{0}(x):=3/8+\bigl{|}x_{1}+1/4-[x_{1}+1/4]\bigr{|}/2$ , for $x=(x_{1},\ldots,x_{d})^{T}$ . Further, for $x\in\mathbb{R}^{d}$ , set $u(x):=\frac{\alpha_{0}g^{-1}(1/q)}{q^{2}}\bigl{(}1/4-q^{2}\|x-n_{q}(x)\|^{2}\bigr{)}^{4}$ , where $\alpha_{0}:=1/27$ .

For $\sigma:=(\sigma_{1},\ldots,\sigma_{m})^{T}\in\{-1,1\}^{m}$ , we now define the distribution $P_{\sigma}$ on $\mathbb{R}^{d}\times\{0,1\}$ by setting the regression function to be $\eta_{\sigma}(x):=\eta_{0}(x)+\frac{1}{2}\sigma_{l}u(x)$ , for $x\in\mathcal{X}_{l}$ , $l=1,\ldots,m$ , and setting $\eta_{\sigma}(x):=\eta_{0}(x)$ , otherwise. To define the marginal distribution on $\mathbb{R}^{d}$ induced by $P_{\sigma}$ , which will be the same for each $\sigma$ , we first define the boxes $B_{0}:=(0,\lceil q^{\nu}\rceil+3/2)^{d}$ and $B_{r}:=[-r/2+1/4-a/16,-r/2+1/4+a/16]\times[-a,a]^{d-1}$ for $r=1,\ldots,20$ and some $a>0$ to be chosen later. We further define a modified bump function by

[TABLE]

where $\Phi$ denotes the standard normal distribution function. For $x\in\mathbb{R}^{d}$ we then set

[TABLE]

for some $w_{0}<1/(\lceil q^{\nu}\rceil+2)^{d}$ to be specified later. Here, $a$ in the definition of $B_{r}$ is chosen such that $\int_{\mathbb{R}^{d}}\bar{f}=1$ , and we note that

[TABLE]

so $a\leq(4/5)^{1/d}/2$ .

Let

[TABLE]

We show below that $\mathcal{P}_{m}\subseteq\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,2,\lambda}$ for all $\theta\in\Theta$ and $\lambda>0$ satisfying the conditions of the theorem.

Letting $\mathbb{E}_{\sigma}$ denote expectation with respect to $P_{\sigma}^{\otimes n}$ and writing $[[x_{1}]]:=x_{1}-[x_{1}+1/4]$ for $x_{1}\in\mathbb{R}$ , we have that, for any classifier $C_{n}$ ,

[TABLE]

Now let $\sigma_{l,r}:=(\sigma_{1},\ldots,\sigma_{l-1},r,\sigma_{l+1},\ldots,\sigma_{m})$ for $l=1,\ldots,m$ , and $r\in\{-1,0,1\}$ , and define the distribution $P_{l,r}$ on $\mathbb{R}^{d}\times\{0,1\}$ by $\eta_{l,r}(x):=\eta_{0}(x)+(1/2)ru(x)$ , for $x\in\mathcal{X}_{l}$ and $\eta_{l,r}(x)=\eta_{\sigma_{l,r}}(x):=\eta_{\sigma}(x)$ otherwise (the marginal distribution on $\mathbb{R}^{d}$ is again taken to be $P_{X}$ ). We write $\mathbb{E}_{l,r}$ to denote expectation with respect to $P_{l,r}^{\otimes n}$ .

For $l=1,\ldots,m$ and $r\in\{-1,1\}$ define

[TABLE]

By the Radon–Nikodym theorem, we have that

[TABLE]

Now fix $x=(x_{1},\ldots,x_{d})^{T}\in\mathcal{X}_{l}$ , and writing $C_{n}=C_{n}(x),\eta_{l,1}=\eta_{l,1}(x)$ and $\eta_{l,-1}=\eta_{l,-1}(x)$ as shorthand, observe that

[TABLE]

Here we used the fact that $\eta_{l,1}(x)\geq\eta_{l,-1}(x)$ , so $\mathbbm{1}_{\{\eta_{l,1}(x)<1/2,\eta_{l,-1}(x)\geq 1/2\}}=0$ , and that the minimum is attained by taking $C_{n}(x)=\mathbbm{1}_{\{[[x_{1}]]\geq 0\}}$ for $x\in\mathcal{X}_{l}$ ; it is interesting to note that this remains the optimal classifier even if $\bar{f}$ is known. Moreover, whenever $[[x_{1}]]\geq 0$ , we have $\eta_{l,1}(x)\geq 1/2$ , and when $[[x_{1}]]<0$ , we have $\eta_{l,-1}(x)<1/2$ . It follows that

[TABLE]

where $\tilde{u}(x):=\alpha_{0}g^{-1}(1/q)q^{6}(\frac{1}{4q^{2}}-\|x\|^{2})^{4}$ and $\tilde{\eta}(x):=\frac{1}{2}\{1+x_{1}-\tilde{u}(x)\}$ .

Now, observe that

[TABLE]

and

[TABLE]

Moreover, using the fact that $\log(1+x)\leq x$ for $x\geq 0$ , we have that

[TABLE]

We now turn to finding a lower bound for the integral in (D.15). First, we observe that $\mathrm{sgn}\bigl{(}\tilde{u}(x)-x_{1}\bigr{)}=\mathrm{sgn}\bigl{(}1/2-\tilde{\eta}(x)\bigr{)}$ , and moreover for $d=1$ and $0\leq x_{1}<\frac{\alpha_{0}g^{-1}(1/q)}{2^{13}q^{2}}$ , we have that

[TABLE]

Thus

[TABLE]

Furthermore, for $d\geq 2$ , writing $x_{-1}:=(x_{2},\ldots,x_{d})^{T}$ , we have that $\tilde{\eta}(x)<1/2$ if and only if

[TABLE]

which is satisfied if

[TABLE]

Now $t(x_{1})$ is real if $0\leq x_{1}\leq\frac{\alpha_{0}g^{-1}(1/q)}{2^{14}q^{2}}$ . Moreover, $t(x_{1})>1/(8q)$ for $x_{1}\in\bigl{[}0,\frac{\alpha_{0}g^{-1}(1/q)}{2^{14}q^{2}}\bigr{]}$ . We also require the observation that $\tilde{u}(x)-x_{1}\geq\frac{\alpha_{0}g^{-1}(1/q)}{2^{14}q^{2}}$ when $x_{1}\in\bigl{[}0,\frac{\alpha_{0}g^{-1}(1/q)}{2^{14}q^{2}}\bigr{]}$ and $\|x_{-1}\|<t(x_{1})$ . Hence

[TABLE]

We have therefore shown that, for $q\geq 3$ ,

[TABLE]

where $a_{0}:=1$ . It follows that if we set

[TABLE]

and choose $q$ to satisfy $\frac{q^{4+d+\nu(\rho+d)}}{g^{-1}(1/q)^{2}}=n$ , then

[TABLE]

It remains to show that $P_{\sigma}$ belongs to the desired classes $\mathcal{P}_{d,\theta}\cap\mathcal{Q}_{d,2,\lambda}$ for each $\sigma$ . First note that

[TABLE]

Condition (A.1) is satisfied by $\bar{f}$ by construction. To verify the minimal mass assumption, we take $\epsilon_{*}<2^{-\max(d,5)}$ , and observe that when $\epsilon_{0}\in(0,\epsilon_{*}]$ ,

[TABLE]

as required. It follows that (A.2) is satisfied for such $\epsilon_{0}\in(0,\epsilon_{*}]$ and for any $M_{0}\geq 1$ .

The main condition to check is (A.3). For $x\in B_{1/(2q)}(0)$ , consider

[TABLE]

Then

[TABLE]

and

[TABLE]

From these calculations, we see that each $\eta_{\sigma}$ is twice continuously differentiable on $\mathcal{S}^{2\epsilon_{0}}$ , with $\|\dot{\eta}_{\sigma}(x)\|\in(1/4,3/4)$ for all $x\in\mathcal{S}^{2\epsilon_{0}}$ and $\|\ddot{\eta}_{\sigma}(x)\|_{\mathrm{op}}\leq 1$ . We have that, when $n_{q}(z)=n_{q}(x)$ ,

[TABLE]

Hence, using the fact that $r\mapsto r/g^{-1}(r)$ is increasing for sufficiently small $r>0$ , we have that for sufficiently large $q$ ,

[TABLE]

Now consider the case where $z\in\mathcal{X}_{l}$ and $x\in\mathcal{X}_{l^{\prime}}$ with $l\neq l^{\prime}$ , so that $n_{q}(z)\neq n_{q}(x)$ . Let $z^{\prime}$ denote the closest point in $\mathcal{X}_{l}$ to $\mathcal{X}_{l^{\prime}}$ on the line segment joining $x$ to $z$ , and similarly let $x^{\prime}$ denote the closest point in $\mathcal{X}_{l^{\prime}}$ to $\mathcal{X}_{l}$ on the same line segment. Then $\ddot{\eta}_{\sigma}(x^{\prime})=\ddot{\eta}_{\sigma}(z^{\prime})=0$ , so, by (D.15),

[TABLE]

We therefore deduce that

[TABLE]

For the final part of (A.3), we note that

[TABLE]

Finally, we check the moment condition in (A.4). First,

[TABLE]

say. We conclude that there exists $q_{*}=q_{*}(d)$ such that for $q\geq q_{*}$ and any $\nu\geq 0$ , we have $P\in\mathcal{P}_{d,\theta}$ for $\theta=(\epsilon_{0},M_{0},\rho,\ell,g)$ with any $\rho>0$ , $M_{0}\geq\max(M_{01}(\rho),1)$ , $\epsilon_{0}\in(0,\min(2^{-\max(d,5)},1/(4M_{0})))$ , any $\ell\in\mathcal{L}$ with $\ell\geq 2/\epsilon_{0}$ and any $g\in\mathcal{G}$ .

Finally, we note that $\|\bar{f}\|_{\infty}\leq 1$ and

[TABLE]

Hence $P\in\mathcal{Q}_{d,2,\lambda}$ for $\lambda\geq 2^{10}\times 5$ .

Appendix E Proof of Theorem 5 (continued)

Proof E.16 (Proof of Theorem 6.7 – Step 7).

To complete the proof of Theorem 6.7, it remains to bound the error terms $R_{1},R_{2},R_{5}$ and $R_{6}$ .

To bound $R_{1}$ : We have

[TABLE]

By a Taylor expansion and (A.3), for all $\epsilon\in(0,1)$ , $x\in\mathcal{S}^{\epsilon_{0}}$ and $\|z-x\|<\min\{g(\epsilon),\epsilon_{0}\}=:r$ ,

[TABLE]

Hence

[TABLE]

Now, by similar arguments to those leading to (6.8), we have that

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ . Moreover, for every $M>0$ ,

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ , by (16) in Step 1. For the remaining terms, note that

[TABLE]

Let $t_{0}=t_{0}(x):=5^{2/\rho}(1+2^{\rho-1})^{2/\rho}\bigl{(}M_{0}+\|x\|^{\rho}\bigr{)}^{2/\rho}$ . Then, for $t\geq t_{0}$ , we have

[TABLE]

It follows by Bennett’s inequality that for $\rho\{n-(n-1)^{1-\beta}\}>4$ ,

[TABLE]

But, when $\beta\log(n-1)\geq(d+2)/d$ and $n\geq\max\{n_{0},n_{2}\}$ ,

[TABLE]

We deduce that for every $M>0$ ,

[TABLE]

Moreover, by Bernstein’s inequality, for every $M>0$ ,

[TABLE]

We conclude from (6.8), (E.16), (40), (41), (E.16), (43) and (44), together with Jensen’s inequality to deal with the third term on the right-hand side of (E.16), that (10) holds. With only simple modifications, we have also shown (13), which bounds $R_{2}$ .

To bound $R_{5}$ : Write

[TABLE]

Now by a non-uniform version of the Berry–Esseen theorem (Paditz, 1989, Theorem 1), for every $t\in(-\epsilon_{n},\epsilon_{n})$ and $x_{0}\in\mathcal{S}_{n}$ ,

[TABLE]

Let

[TABLE]

where

[TABLE]

In the following we integrate the bound in (45) over the regions $|t|\leq t_{n}$ and $|t|\in(t_{n},\epsilon_{n})$ separately. Define the event

[TABLE]

so that, by very similar arguments to those used to bound $\mathbb{P}(A_{k_{\mathrm{L}}}^{c})$ in Step 2, we have $\mathbb{P}(B_{k_{\mathrm{L}}}^{c})=O(n^{-M})$ for every $M>0$ , uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k_{\mathrm{L}}\in K_{\beta,\tau}$ . It follows by (45) and Step 2 that there exists $n_{4}\in\mathbb{N}$ such that for all $n\geq n_{4}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ and $x_{0}\in\mathcal{S}_{n}$ ,

[TABLE]

By Step 1, there exists $n_{5}\in\mathbb{N}$ such that for $n\geq n_{5}$ , $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|\in(t_{n},\epsilon_{n})$ ,

[TABLE]

Thus for $n\geq n_{5}$ , $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|\in(t_{n},\epsilon_{n})$ , we have that

[TABLE]

It follows by (45), (E.16) and Step 3 that, for $n\geq n_{5}$ ,

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ and $x_{0}\in\mathcal{S}_{n}$ . We conclude from (E.16) and (E.16) that $|R_{5}|=o(\gamma_{n}(k_{\mathrm{L}}))$ , uniformly for $P\in\mathcal{P}_{d,\theta}$ and $k_{\mathrm{L}}\in K_{\beta,\tau}$ .

To bound $R_{6}$ : Let $\theta(x_{0}^{t}):=-2k_{\mathrm{L}}(x_{0}^{t})^{1/2}\{\mu_{n}(x_{0}^{t})-1/2\}$ . Write

[TABLE]

where

[TABLE]

and

[TABLE]

To bound $R_{61}$ : We again deal with the regions $|t|\leq t_{n}$ and $|t|\in(t_{n},\epsilon_{n})$ separately. First let $\tilde{\theta}(x_{0}^{t}):=-2k_{\mathrm{L}}(x_{0}^{t})^{1/2}\{\hat{\mu}_{n}(x_{0}^{t},X^{n})-1/2\}$ . Writing $\phi$ for the standard normal density, and using the facts that $|\hat{\theta}(x_{0}^{t})|\geq|\tilde{\theta}(x_{0}^{t})|$ , that $\hat{\theta}(x_{0}^{t})$ and $\tilde{\theta}(x_{0}^{t})$ have the same sign, and that $|x\phi(x)|\leq 1$ , we have

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ and $x_{0}\in\mathcal{S}_{n}$ . Note that for $|t|\in(t_{n},\epsilon_{n})$ and $x_{0}\in\mathcal{S}_{n}$ , we have when $\epsilon_{n}<\epsilon_{0}$ and $n\geq n_{5}$ that

[TABLE]

Thus by (E.16), (E.16), (E.16) and Step 3, for $\epsilon_{n}<\epsilon_{0}$ and $n\geq n_{5}$ ,

[TABLE]

uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ and $x_{0}\in\mathcal{S}_{n}$ .

To bound $R_{62}$ : Let

[TABLE]

Given $\epsilon>0$ small enough that $\epsilon^{2}+\frac{\epsilon}{2\epsilon_{0}}<1/2$ , by Step 1 there exists $n_{6}\in\mathbb{N}$ such that for $n\geq n_{6}$ , $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $|t|<\epsilon_{n}$ ,

[TABLE]

By decreasing $\epsilon$ and increasing $n_{6}$ if necessary, it follows that

[TABLE]

for all $n\geq n_{6}$ , $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , $x_{0}\in\mathcal{S}_{n}$ and $t\in(-\epsilon_{n},\epsilon_{n})$ satisfying $2\epsilon u(x_{0})\ell\bigl{(}\bar{f}(x_{0})\bigr{)}\|\dot{\eta}(x_{0})\|\leq|\bar{\theta}(x_{0},t)|$ . Substituting $u=\bar{\theta}(x_{0},t)/2$ , it follows that there exists $C^{*}>0$ such that for all $n\geq n_{6}$ , $P\in\mathcal{P}_{d,\theta}$ and $k_{\mathrm{L}}\in K_{\beta,\tau}$ ,

[TABLE]

The combination of (E.16) and (E.16) yields the desired error bound on $|R_{6}|$ in (26), uniformly for $P\in\mathcal{P}_{d,\theta}$ , $k_{\mathrm{L}}\in K_{\beta,\tau}$ , and therefore completes the proof.

Appendix F Empirical analysis

In this section, we compare the $k_{\mathrm{O}}$ nn and $k_{\mathrm{SS}}$ nn classifiers, introduced in Section 4 of the main text, with the standard $k$ nn classifier studied in Section 3 of the main text. We investigate three settings that reflect the differences between the main results in these sections.

•

Setting 1: $P_{1}$ is the distribution of $d$ independent $N(0,1)$ components; whereas $P_{0}$ is the distribution of $d$ independent $N(1,1/4)$ components.

•

Setting 2: $P_{1}$ is the distribution of $d$ independent $t_{5}$ components; $P_{0}$ is the distribution of $d$ independent components, the first $\lfloor d/2\rfloor$ having a $t_{5}$ distribution and the remainder having a $N(1,1)$ distribution.

•

Setting 3: $P_{1}$ is the distribution of $d$ independent standard Cauchy components; $P_{0}$ is the distribution of $d$ independent components, the first $\lfloor d/2\rfloor$ being standard Cauchy and the remainder standard normal.

The corresponding marginal distribution $P_{X}$ in Setting 1 satisfies (A.4) for every $\rho>0$ . Hence, for the standard $k$ -nearest neighbour classifier when $d\geq 5$ , we are in the setting of Theorem 1(i), while for $d\leq 4$ , we can only appeal to Theorem 1(ii). On the other hand, for the local- $k$ -nearest neighbour classifiers, the results of Theorems 2(i) and 3(i) apply for all dimensions, and we can expect the excess risk to converge to zero at rate $O(n^{-4/(d+4)})$ . In Setting 2, (A.4) holds for $\rho<5$ , but not for $\rho\geq 5$ . Thus, for the standard $k$ -nearest neighbour classifier, we are in the setting of Theorem 1(ii) for $d<20$ , whereas Theorems 2(i) and 3(i) again apply for all dimensions for the local classifiers. Finally, in Setting 3, (A.4) does not hold for any $\rho\geq 1$ , and only the conditions of Theorems 1(ii), 2(ii) and 3(ii) apply.

For the standard $k$ nn classifier, we use 5-fold cross validation to choose $k$ , based on a sequence of equally-spaced values between 1 and $\lfloor n/4\rfloor$ of length at most 40. For the oracle classifier, we set

[TABLE]

where $\hat{B}_{\mathrm{O}}$ was again chosen via 5-fold cross validation, but based on a sequence of 40 equally-spaced points between $n^{-4/(d+4)}$ (corresponding to the 1-nearest neighbour classifier) and $n^{d/(d+4)}$ . Similarly, for the semi-supervised classifier, we set

[TABLE]

where $\hat{B}_{\mathrm{SS}}$ was chosen analogously to $\hat{B}_{\mathrm{O}}$ , and where $\hat{f}_{m}$ is the $d$ -dimensional kernel density estimator constructed using a truncated normal kernel and bandwidths chosen via the default method in the R package ks (Duong, 2015). In practice, we estimated $\|\hat{f}_{m}\|_{\infty}$ by the maximum value attained on the unlabelled training set.

In each of the three settings above, we generated a training set of size $n\in\{50,200,1000\}$ in dimensions $d\in\{1,2,5\}$ , an unlabelled training set of size 1000, and a test set of size 1000. In Table 1, we present the sample mean and standard error (in subscript) of the risks computed from 1000 repetitions of each experiment. Further, we present estimates of the regret ratios, given by

[TABLE]

for which the standard errors given are estimated via the delta method. From Table 1, we saw improvement in performance from the oracle and semi-supervised classifiers in 22 of the 27 experiments, comparable performance in three experiments, and there were two where the standard $k$ nn classifier was the best of the three classifiers considered. In those latter two cases, the theoretical improvement expected for the local classifiers is small; for instance, when $d=5$ in Setting 2, the excess risk for the local classifiers converges at rate $O(n^{-4/9})$ , while the standard $k$ -nearest neighbour classifier can attain a rate at least as fast as $o(n^{-1/3+\epsilon})$ for every $\epsilon>0$ . It is therefore perhaps unsurprising that we require the larger sample size of $n=1000$ for the local classifiers to yield an improvement in this case. The semi-supervised classifier exhibits similar performance to the oracle classifier in all settings, though some deterioration is noticeable in higher dimensions, where it is harder to construct a good estimate of $\bar{f}$ from the unlabelled training data.

Appendix G An introduction to differential geometry, tubular neighbourhoods and integration on manifolds

The purpose of this section is to give a brief introduction to the ideas from differential geometry, specifically tubular neighbourhoods and integration on manifolds, which play an important role in our analysis of misclassification error rates, but which we expect are unfamiliar to many statisticians. For further details and several of the proofs, we refer the reader to the many excellent texts on these topics, e.g. Guillemin and Pollack (1974), Gray (2004).

G.1 Manifolds and regular values

Recall that if $\mathcal{X}$ is an arbitrary subset of $\mathbb{R}^{M}$ , we say $\phi:\mathcal{X}\rightarrow\mathbb{R}^{N}$ is differentiable if for each $x\in\mathcal{X}$ , there exists an open subset $U\subseteq\mathbb{R}^{M}$ containing $x$ and a differentiable function $F:U\rightarrow\mathbb{R}^{N}$ such that $F(z)=\phi(z)$ for $z\in U\cap\mathcal{X}$ . If $\mathcal{Y}$ is also a subset of $\mathbb{R}^{M}$ , we say $\phi:\mathcal{X}\rightarrow\mathcal{Y}$ is a diffeomorphism if $\phi$ is bijective and differentiable and if its inverse $\phi^{-1}$ is also differentiable. We then say $\mathcal{S}\subseteq\mathbb{R}^{d}$ is an $m$ -dimensional manifold if for each $x\in\mathcal{S}$ , there exist an open subset $U_{x}\subseteq\mathbb{R}^{m}$ , a neighbourhood $V_{x}$ of $x$ in $\mathcal{S}$ and a diffeomorphism $\phi_{x}:U_{x}\rightarrow V_{x}$ . Such a diffeomorphism $\phi_{x}$ is called a local parametrisation of $\mathcal{S}$ around $x$ , and we sometimes suppress the dependence of $\phi_{x},U_{x}$ and $V_{x}$ on $x$ . It turns out that the specific choice of local parametrisation is usually not important, and properties of the manifold are well-defined regardless of the choice made.

Let $\mathcal{S}\subseteq\mathbb{R}^{d}$ be an $m$ -dimensional manifold and let $\phi:U\rightarrow\mathcal{S}$ be a local parametrisation of $\mathcal{S}$ around $x\in\mathcal{S}$ , where $U$ is an open subset of $\mathbb{R}^{m}$ . Assume that $\phi(0)=x$ for convenience. The tangent space $T_{x}(\mathcal{S})$ to $\mathcal{S}$ at $x$ is defined to be the image of the derivative $D\phi_{0}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{d}$ of $\phi$ at [math]. Thus $T_{x}(\mathcal{S})$ is the $m$ -dimensional subspace of $\mathbb{R}^{d}$ whose parallel translate $x+T_{x}(\mathcal{S})$ is the best affine approximation to $\mathcal{S}$ through $x$ , and $(D\phi_{0})^{-1}$ is well-defined as a map from $T_{x}(\mathcal{S})$ to $\mathbb{R}^{m}$ . If $f:\mathcal{S}\rightarrow\mathbb{R}$ is differentiable, we define the derivative $Df_{x}:T_{x}(\mathcal{S})\rightarrow\mathbb{R}$ of $f$ at $x$ by $Df_{x}:=Dh_{0}\circ(D\phi_{0})^{-1}$ , where $h:=f\circ\phi$ .

In practice, it is usually rather inefficient to define manifolds through explicit diffeomorphisms. Instead, we can often obtain them as level sets of differentiable functions. Suppose that $\mathcal{R}\subseteq\mathbb{R}^{d}$ is a manifold and $\eta:\mathcal{R}\rightarrow\mathbb{R}$ is differentiable. We say $y\in\mathbb{R}$ is a regular value for $\eta$ if $\mathrm{image}(D\eta_{x})=\mathbb{R}$ for every $x\in\mathcal{R}$ for which $\eta(x)=y$ . If $y\in\mathbb{R}$ is a regular value of $\eta$ , then $\eta^{-1}(y)$ is a $(d-1)$ -dimensional submanifold of $\mathcal{R}$ (Guillemin and Pollack, 1974, p. 21).

G.2 Tubular neighbourhoods of level sets

For any set $\mathcal{S}\subseteq\mathbb{R}^{d}$ and $\epsilon>0$ , we call $\mathcal{S}+\epsilon B_{1}(0)$ the $\epsilon$ -neighbourhood of $\mathcal{S}$ . In circumstances where $\mathcal{S}$ is a $(d-1)$ -dimensional manifold defined by the level set of a continuously differentiable function $\eta:\mathbb{R}^{d}\rightarrow\mathbb{R}$ with non-vanishing derivative on $\mathcal{S}$ , the set $\mathcal{S}^{\epsilon}$ is often called a tubular neighbourhood, and $\dot{\eta}(x)^{T}v=0$ for all $x\in\mathcal{S}$ and $v\in T_{x}(\mathcal{S})$ . We therefore have the following useful representation of the $\epsilon$ -neighbourhood of $\mathcal{S}$ in terms of points on $\mathcal{S}$ and a perturbation in a normal direction.

Proposition G.17.

Let $\eta:\mathbb{R}^{d}\rightarrow[0,1]$ , suppose that $\mathcal{S}:=\{x\in\mathbb{R}^{d}:\eta(x)=1/2\}$ is non-empty, and suppose further that $\eta$ is continuously differentiable on $\mathcal{S}+\epsilon B_{1}(0)$ for some $\epsilon>0$ , with $\dot{\eta}(x)\neq 0$ for all $x\in\mathcal{S}$ , so that $\mathcal{S}$ is a $(d-1)$ -dimensional manifold. Then

[TABLE]

Proof G.18.

For any $x_{0}\in\mathcal{S}$ and $|t|<\epsilon$ , we have $x_{0}+t\dot{\eta}(x_{0})/\|\dot{\eta}(x_{0})\|\in\mathcal{S}+\epsilon B_{1}(0)$ . On the other hand, suppose that $x\in\mathcal{S}+\epsilon B_{1}(0)$ . Since $\mathcal{S}$ is closed, there exists $x_{0}\in\mathcal{S}$ such that $\|x-x_{0}\|\leq\|x-y\|$ for all $y\in\mathcal{S}$ . Rearranging this inequality yields that, for $y\neq x_{0}$ ,

[TABLE]

Let $U$ be an open subset of $\mathbb{R}^{d-1}$ and $\phi:U\rightarrow\mathcal{S}$ be a local parametrisation of $\mathcal{S}$ around $x_{0}$ , where without loss of generality we assume $\phi(0)=x_{0}$ . Let $v\in T_{x_{0}}(\mathcal{S})\setminus\{0\}$ be given and let $h\in\mathbb{R}^{d-1}\setminus\{0\}$ be such that $D\phi_{0}(h)=v$ . Then for $t>0$ sufficiently small we have $th\in U$ , so by (54),

[TABLE]

Letting $t\searrow 0$ we see that $(x-x_{0})^{T}v\leq 0$ . Since $v\in T_{x_{0}}(\mathcal{S})\setminus\{0\}$ was arbitrary and $-v\in T_{x_{0}}(\mathcal{S})\setminus\{0\}$ , we therefore have that $(x-x_{0})^{T}v=0$ for all $v\in T_{x_{0}}(\mathcal{S})$ . Moreover, $\dot{\eta}(x_{0})^{T}v=0$ for all $v\in T_{x_{0}}(\mathcal{S})$ , so $x-x_{0}\propto\dot{\eta}(x_{0})$ , which yields the result.

In fact, under a slightly stronger condition on $\eta$ , we have the following useful result:

Proposition G.19.

Let $\mathcal{R}$ be a $d$ -dimensional manifold in $\mathbb{R}^{d}$ , suppose that $\eta:\mathcal{R}\rightarrow[0,1]$ satisfies the condition that $\mathcal{S}:=\{x\in\mathcal{R}:\eta(x)=1/2\}$ is non-empty. Suppose further that there exists $\epsilon>0$ such that $\eta$ is twice continuously differentiable on $\mathcal{S}^{\epsilon}$ . Assume that $\dot{\eta}(x_{0})\neq 0$ for all $x_{0}\in\mathcal{S}$ . Define $g:\mathcal{S}\times(-\epsilon,\epsilon)\rightarrow\mathcal{S}^{\epsilon}$ by

[TABLE]

If

[TABLE]

then $g$ is injective. In fact $g$ is a diffeomorphism, with

[TABLE]

for $v_{1}\in T_{x_{0}}(\mathcal{S})$ and $v_{2}\in\mathbb{R}$ , where

[TABLE]

Proof G.20.

Assume for a contradiction that there exist distinct points $x_{1},x_{2}\in\mathcal{S}$ and $t_{1},t_{2}\in(-\epsilon,\epsilon)$ with $|t_{1}|\geq|t_{2}|$ such that

[TABLE]

Then

[TABLE]

By Taylor’s theorem and (58),

[TABLE]

contradicting the hypothesis (55).

To show that $g$ is a diffeomorphism, let $x_{0}\in\mathcal{S}$ be given and let $\phi:U\rightarrow\mathcal{S}$ be a local parametrisation around $x_{0}$ with $\phi(0)=x_{0}$ . Define $\Phi:U\times(-\epsilon,\epsilon)\rightarrow\mathcal{S}\times(-\epsilon,\epsilon)$ by $\Phi(u,t):=(\phi(u),t)$ , and $H:U\times(-\epsilon,\epsilon)\rightarrow\mathcal{S}^{\epsilon}$ by $H:=g\circ\Phi$ . Finally, define the Gauss map $n:\mathcal{S}\rightarrow\mathbb{R}^{d}$ by $n(x_{0}):=\dot{\eta}(x_{0})/\|\dot{\eta}(x_{0})\|$ . Then, for $h=(h_{1}^{T},h_{2})^{T}\in\mathbb{R}^{d-1}\times\mathbb{R}$ and $s\in\mathbb{R}\setminus\{0\}$ ,

[TABLE]

where $Dg_{(x_{0},t)}:T_{x_{0}}(\mathcal{S})\times\mathbb{R}\rightarrow\mathbb{R}^{d}$ is given in (56).

To show that $Dg_{(x_{0},t)}$ is invertible, note that for $v_{1}\in T_{x_{0}}(\mathcal{S})$ and $|t|<\epsilon$ ,

[TABLE]

where the final inequality follows from (55). Then, since $v_{1}+\frac{t}{\|\dot{\eta}(x_{0})\|}\Bigl{(}I-\frac{\dot{\eta}(x_{0})\dot{\eta}(x_{0})^{T}}{\|\dot{\eta}(x_{0})\|^{2}}\Bigr{)}\ddot{\eta}(x_{0})v_{1}$ and $n(x_{0})v_{2}$ are orthogonal, it follows that $Dg_{(x_{0},t)}$ is indeed invertible. The inverse function theorem (e.g. Guillemin and Pollack, 1974, p. 13) then gives that $g$ is a local diffeomorphism, and moreover, by Guillemin and Pollack (1974, Exercise 5, p. 18) and the fact that $g$ is bijective, we can conclude that $g$ is in fact a diffeomorphism.

G.3 Forms, pullbacks and integration on manifolds

Let $V$ be a (real) vector space of dimension $m$ . We say $T:V^{p}\rightarrow\mathbb{R}$ is a $p$ -tensor on $V$ if it is $p$ -linear, and write $\mathcal{F}^{p}(V^{*})$ for the set of $p$ -tensors on $V$ . If $T\in\mathcal{F}^{p}(V^{*})$ and $S\in\mathcal{F}^{q}(V^{*})$ , we define their tensor product $T\otimes S\in\mathcal{F}^{p+q}(V^{*})$ by

[TABLE]

Let $S_{p}$ denote the set of permutations of $\{1,\ldots,p\}$ . If $\pi\in S_{p}$ and $T\in\mathcal{F}^{p}(V^{*})$ , we can define $T^{\pi}\in\mathcal{F}^{p}(V^{*})$ by $T^{\pi}(v):=T(v_{\pi(1)},\ldots,v_{\pi(p)})$ for $v=(v_{1},\ldots,v_{p})\in V^{p}$ . We say $T$ is alternating if $T^{\sigma}=-T$ for all transpositions $\sigma:\{1,\ldots,p\}\rightarrow\{1,\ldots,p\}$ . The set of alternating $p$ -tensors on $V$ , denoted $\Lambda^{p}(V^{*})$ , is a vector space of dimension $\binom{m}{p}$ . The function $\mathrm{Alt}:\mathcal{F}^{p}(V^{*})\rightarrow\Lambda^{p}(V^{*})$ is defined by

[TABLE]

where $\mathrm{sgn}(\pi)$ denotes the sign of the permutation $\pi$ . If $T\in\Lambda^{p}(V^{*})$ and $S\in\Lambda^{q}(V^{*})$ , we define their wedge product $T\wedge S\in\Lambda^{p+q}(V^{*})$ by

[TABLE]

If $W$ is another (real) vector space and $A:V\rightarrow W$ is a linear map, we define the transpose $A^{\ast}:\Lambda^{p}(W^{*})\rightarrow\Lambda^{p}(V^{*})$ of $A$ by

[TABLE]

Let $\mathcal{S}$ be a manifold. A $p$ -form $\omega$ on $\mathcal{S}$ is a function which assigns to each $x\in\mathcal{S}$ an element $\omega(x)\in\Lambda^{p}(T_{x}(\mathcal{S})^{*})$ . If $\omega$ is a $p$ -form on $\mathcal{S}$ and $\theta$ is a $q$ -form on $\mathcal{S}$ , we can define their wedge product $\omega\wedge\theta$ by $(\omega\wedge\theta)(x):=\omega(x)\wedge\theta(x)$ . For $j=1,\ldots,m$ , let $x_{j}:\mathbb{R}^{m}\rightarrow\mathbb{R}$ denote the coordinate function $x_{j}(y_{1},\ldots,y_{m}):=y_{j}$ . These functions induce $1$ -forms $dx_{j}$ , given by $dx_{j}(x)(y_{1},\ldots,y_{m})=y_{j}$ (so $dx_{j}(x)=D(x_{j})_{x}$ in our previous notation). Letting $\mathcal{I}:=\{(i_{1},\ldots,i_{p}):1\leq i_{1}<\ldots<i_{p}\leq m\}$ , for $I=(i_{1},\ldots,i_{p})\in\mathcal{I}$ , we write

[TABLE]

It turns out (Guillemin and Pollack, 1974, p. 163) that any $p$ -form on an open subset $U$ of $\mathbb{R}^{m}$ can be uniquely expressed as

[TABLE]

where each $f_{I}$ is a real-valued function on $U$ .

Recall that the set of all ordered bases of a vector space $V$ is partitioned into two equivalence classes, and an orientation of $V$ is simply an assignment of a positive sign to one equivalence class and a negative sign to the other. If $V$ and $W$ are oriented vector spaces in the sense that an orientation has been specified for each of them, then an isomorphism $A:V\rightarrow W$ always either preserves orientation in the sense that for any ordered basis $\beta$ of $V$ , the ordered basis $A\beta$ has the same sign as $\beta$ , or it reverses it. We say an $m$ -dimensional manifold $\mathcal{X}$ is orientable if for every $x\in\mathcal{X}$ , there exist an open subset $U$ of $\mathbb{R}^{m}$ , a neighbourhood $V$ of $x$ in $\mathcal{X}$ and a diffeomorphism $\phi:U\rightarrow V$ such that $D\phi_{u}:\mathbb{R}^{m}\rightarrow T_{x}(\mathcal{X})$ preserves orientation for every $u\in U$ . A map like $\phi$ above whose derivative at every point preserves orientation is called an orientation-preserving map.

If $\mathcal{X}$ and $\mathcal{Y}$ are manifolds, $\omega$ is a $p$ -form on $\mathcal{Y}$ and $\psi:\mathcal{X}\rightarrow\mathcal{Y}$ is differentiable, we define the pullback $\psi^{\ast}\omega$ of $\omega$ by $\psi$ to be the $p$ -form on $\mathcal{X}$ given by

[TABLE]

If $V$ is an $p$ -dimensional vector space and $A:V\rightarrow V$ is linear, then $A^{\ast}T=(\det A)T$ for all $T\in\Lambda^{p}(V)$ (Guillemin and Pollack, 1974, p. 160).

If $\omega$ is an $m$ -form on an open subset $U$ of $\mathbb{R}^{m}$ , then by (59), we can write $\omega=f\,dx_{1}\wedge\ldots\wedge dx_{m}$ . If $\omega$ is an integrable form on $U$ (i.e. $f$ is an integrable function on $U$ ), we can define the integral of $\omega$ over $U$ by

[TABLE]

where the integral on the right-hand side is a usual Lebesgue integral. Now let $\mathcal{S}$ be an $m$ -dimensional orientable manifold that can be parametrised with a single chart, in the sense that there exists an open subset $U$ of $\mathbb{R}^{m}$ and an orientation-preserving diffeomorphism $\phi:U\rightarrow\mathcal{S}$ . Define the support of an $m$ -form $\omega$ on $\mathcal{S}$ to be the closure of $\{x\in\mathcal{S}:\omega(x)\neq 0\}$ . If $\omega$ is compactly supported, then its pullback $\phi^{\ast}\omega$ is a compactly supported $m$ -form on $U$ ; moreover $\phi^{\ast}\omega$ is integrable, and we can define the integral over $\mathcal{S}$ of $\omega$ by

[TABLE]

Alternatively, we can suppose that $\omega$ is non-negative and measurable in the sense that $\phi^{\ast}\omega=f\,dx_{1}\wedge\ldots\wedge dx_{m}$ , say, with $f$ non-negative and measurable on $U$ . In this case, we can also define the integral of $\omega$ over $\mathcal{S}$ via (60).

More generally, integrals of forms over more complicated manifolds can be defined via partitions of unity. Recall (Guillemin and Pollack, 1974, p. 52) that if $\mathcal{X}$ is an arbitrary subset of $\mathbb{R}^{M}$ , and $\{V_{\alpha}:\alpha\in A\}$ is a (relatively) open cover of $\mathcal{X}$ , then there exists a sequence of real-valued, differentiable functions $(\rho_{n})$ on $\mathcal{X}$ , called a partition of unity with respect to $\{V_{\alpha}:\alpha\in A\}$ , with the following properties:

$\rho_{n}(x)\in[0,1]$ for all $n\in\mathbb{N}$ ; 2. 2.

Each $x\in\mathcal{X}$ has a neighbourhood on which all but finitely many functions $\rho_{n}$ are identically zero; 3. 3.

Each $\rho_{n}$ is identically zero except on some closed set contained in some $V_{\alpha}$ ; 4. 4.

$\sum_{n=1}^{\infty}\rho_{n}(x)=1$ for all $x\in\mathcal{X}$ .

Now let $\mathcal{S}\subseteq\mathbb{R}^{d}$ be an $m$ -dimensional, orientable manifold, so for each $x\in\mathcal{S}$ , there exist an open subset $U_{x}$ of $\mathbb{R}^{m}$ , a neighbourhood $V_{x}$ of $x$ in $\mathcal{S}$ and an orientation-preserving diffeomorphism $\phi_{x}:U_{x}\rightarrow V_{x}$ . If $\omega$ is a compactly supported $m$ -form on $\mathcal{S}$ and $(\rho_{n})$ denotes a partition of unity on $\mathcal{S}$ with respect to $\{V_{x}:x\in\mathcal{S}\}$ , we can define the integral of $\omega$ over $\mathcal{S}$ by

[TABLE]

In fact, writing $\Omega$ for the compact support of $\omega$ , we can find a neighbourhood $W_{x}$ of $x\in\Omega$ , $x_{1},\ldots,x_{N}\in\Omega$ and a finite subset $N_{j}$ of $\mathbb{N}$ such that $\{\rho_{n}:n\notin N_{j}\}$ are identically zero on $W_{x_{j}}$ , and such that

[TABLE]

Thus the integral can be written as a finite sum. Similarly, if $\omega$ is a non-negative $m$ -form on $\mathcal{S}$ , we can again define the integral of $\omega$ over $\mathcal{S}$ via (61). Finally, if $\omega$ is an integrable $m$ -form on $\mathcal{S}$ , the integral can be defined by taking positive and negative parts in the usual way.

In our work, we are especially interested in integrals of a particular type of form. Given an $m$ -dimensional, orientable manifold $\mathcal{S}$ in $\mathbb{R}^{d}$ , the volume form $d\mathrm{Vol}^{m}$ is the unique $m$ -form on $\mathcal{S}$ such that at each $x\in\mathcal{S}$ , the alternating $m$ -tensor $d\mathrm{Vol}^{m}(x)$ on $T_{x}(\mathcal{S})$ gives value $1/m!$ to each positively oriented orthonormal basis for $T_{x}(\mathcal{S})$ . For example, when $\mathcal{S}=\mathbb{R}^{m}$ , we have $d\mathrm{Vol}^{m}=dx_{1}\wedge\ldots\wedge dx_{m}$ , provided we consider the standard basis to be positively oriented. As another example, if $\mathcal{R}\subseteq\mathbb{R}^{d}$ is a $d$ -dimensional manifold and $\eta:\mathcal{R}\rightarrow\mathbb{R}$ is continuously differentiable with $\mathcal{S}=\{x\in\mathcal{R}:\eta(x)=1/2\}$ non-empty and $\dot{\eta}(x)\neq 0$ for $x\in\mathcal{S}$ , then $\mathcal{S}$ is a $(d-1)$ -dimensional, orientable manifold (Guillemin and Pollack, 1974, Exercise 18, p. 106). If we say that an ordered, orthonormal basis $e_{1},\ldots,e_{d-1}$ for $T_{x_{0}}(\mathcal{S})$ is positively oriented whenever $\det(e_{1},\ldots,e_{d-1},\dot{\eta}(x_{0}))>0$ , we have that

[TABLE]

where $x_{j}$ denotes the $j$ th coordinate function. We now define an ordered, orthonormal basis $(e_{1},0),\ldots,(e_{d-1},0),(0,1)$ for $T_{x_{0}}(\mathcal{S})\times\mathbb{R}$ to be positively oriented. Further, we define a $(d-1)$ -form $\omega_{1}$ and a $1$ -form $\omega_{2}$ on $\mathcal{S}\times(-\epsilon,\epsilon)$ by

[TABLE]

Then, with $g$ defined as in Proposition G.19, and under the conditions of that proposition,

[TABLE]

so $g^{*}(dx_{1}\wedge\ldots\wedge dx_{d})(x_{0},t)=\det(I+tB)\ (\omega_{1}\wedge\omega_{2})(x_{0},t)$ . It follows that if $h:\mathcal{S}\times(-\epsilon,\epsilon)\rightarrow\mathbb{R}$ is either compactly supported and integrable, or non-negative and measurable, then

[TABLE]

We also require the change of variables formula: if $\mathcal{X}$ and $\mathcal{Y}$ are orientable manifolds and are of dimension $m$ , and if $\psi:\mathcal{X}\rightarrow\mathcal{Y}$ is an orientation-preserving diffeomorphism, then

[TABLE]

for every compactly supported, integrable $m$ -form on $\mathcal{Y}$ (Guillemin and Pollack, 1974, p. 168). In particular, if $f:\mathcal{S}^{\epsilon}\rightarrow\mathbb{R}$ is either compactly supported and integrable, or non-negative and measurable, then writing $x_{0}^{t}:=x_{0}+\frac{t\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}$ , we have from (62) and (63) that

[TABLE]

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abramson (1982) Abramson, I. S. (1982) On bandwidth estimation in kernel estimates – a square root law. Ann. Statist. , 10 , 1217–1223.
2Audibert and Tsybakov (2007) Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. , 35 , 608–633.
3Berrett and Samworth (2019 a) Berrett, T. B. and Samworth, R. J. (2019 a) Efficient two-sample functional estimation and the super-oracle phenomenon. https://arxiv.org/abs/1904.09347 .
4Berrett and Samworth (2019 b) Berrett, T. B. and Samworth, R. J. (2019 b) Nonparametric independence testing via mutual information. Biometrika , to appear.
5Berrett et al. (2019) Berrett, T. B., Samworth, R. J. and Yuan, M. (2019). Efficient multivariate entropy estimation via k 𝑘 k -nearest neighbour distances. Ann. Statist. , 47 , 288–318.
6Biau et al. (2010) Biau, G., Cérou, F. and Guyader, A. (2010). On the rate of convergence of the bagged nearest neighbor estimate. J. Mach. Learn. Res. , 11 , 687–712.
7Biau and Devroye (2015) Biau, G. and Devroye, L. (2015). Lectures on the Nearest Neighbor Method . Springer, New York.
8Boucheron et al. (2005) Boucheron, S., Bousquet, O. and Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: PS , 9 , 323–375.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Local nearest neighbour classification with applications to semi-supervised learning

Abstract

keywords:

keywords:

1 Introduction

2 Statistical setting

3 Global risk of the kkk-nearest neighbour classifier

Example 1**.**

Theorem 1**.**

Example 2**.**

4 Local-kkk-nearest neighbour classifiers

4.1 Oracle classifier

Theorem 2**.**

4.2 The semi-supervised nearest neighbour classifier

Theorem 3**.**

5 Minimax lower bounds

Theorem 4**.**

Remark 5.5**.**

Example 5.6**.**

6 Proofs

6.1 A general asymptotic expansion

Theorem 6.7**.**

Proof 6.8** (Proof of Theorem 6.7).**

6.2 Proof of Theorem 1

Proof 6.9** (Proof of Theorem 1).**

6.3 Proofs of results from Section 4

Proof 6.10** (Proof of Theorem 2).**

Proof 6.11** (Proof of Theorem 3).**

Acknowledgements

Appendix A The relationship between our classes and the margin assumption

Lemma A.12**.**

Proof A.13**.**

Appendix B Example 1 from the main text

Appendix C Example 2 from the main text

Proof C.14** (Proof of claim in Example 2).**

Appendix D Proof of Theorem 4

Proof D.15** (Proof of Theorem 4).**

Appendix E Proof of Theorem 5 (continued)

Proof E.16** (Proof of Theorem 6.7 – Step 7).**

Appendix F Empirical analysis

Appendix G An introduction to differential geometry, tubular neighbourhoods and integration on manifolds

G.1 Manifolds and regular values

G.2 Tubular neighbourhoods of level sets

Proposition G.17**.**

Proof G.18**.**

Proposition G.19**.**

Proof G.20**.**

G.3 Forms, pullbacks and integration on manifolds

3 Global risk of the $k$ -nearest neighbour classifier

Example 1.

Theorem 1.

Example 2.

4 Local- $k$ -nearest neighbour classifiers

Theorem 2.

Theorem 3.

Theorem 4.

Remark 5.5.

Example 5.6.

Theorem 6.7.

Proof 6.8 (Proof of Theorem 6.7).

Proof 6.9 (Proof of Theorem 1).

Proof 6.10 (Proof of Theorem 2).

Proof 6.11 (Proof of Theorem 3).

Lemma A.12.

Proof A.13.

Proof C.14 (Proof of claim in Example 2).

Proof D.15 (Proof of Theorem 4).

Proof E.16 (Proof of Theorem 6.7 – Step 7).

Proposition G.17.

Proof G.18.

Proposition G.19.

Proof G.20.