Nonlinear generalization of the monotone single index model

Zeljko Kereta; Timo Klock; Valeriya Naumova

arXiv:1902.09024·math.ST·December 8, 2020

Nonlinear generalization of the monotone single index model

Zeljko Kereta, Timo Klock, Valeriya Naumova

PDF

1 Repo

TL;DR

This paper introduces a nonlinear extension of the single index model that employs multiple index vectors and local estimation techniques, enhancing flexibility and adaptability in modeling complex relationships.

Contribution

It proposes a novel nonlinear generalization of the single index model using local linear regression and geodesic metrics, with theoretical guarantees and empirical validation.

Findings

01

The method accurately estimates local index vectors.

02

It outperforms state-of-the-art methods on synthetic data.

03

It demonstrates strong predictive performance on real-world datasets.

Abstract

Single index model is a powerful yet simple model, widely used in statistics, machine learning, and other scientific fields. It models the regression function as $g (< a, x >)$ , where a is an unknown index vector and x are the features. This paper deals with a nonlinear generalization of this framework to allow for a regressor that uses multiple index vectors, adapting to local changes in the responses. To do so we exploit the conditional distribution over function-driven partitions, and use linear regression to locally estimate index vectors. We then regress by applying a kNN type estimator that uses a localized proxy of the geodesic metric. We present theoretical guarantees for estimation of local index vectors and out-of-sample prediction, and demonstrate the performance of our method with experiments on synthetic and real-world data sets, comparing it with state-of-the-art methods.

Figures23

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1: Summary of the notation used in the paper

symbol	definition
geometry
$γ, Im (γ)$	$γ : I \subset ℝ \to ℝ^{D}$ is the parametrization of $Im (γ) = γ (I)$
$π_{γ}$	the orthogonal projection onto $Im (γ)$ , see (3)
$τ_{γ}$	${sup}_{r > 0} {\forall x \in ℝ^{D} ∖ Im (γ) s.t. dist (x; Im (γ)) < r \exists! z \in Im (γ) s.t. dist (x; z) = dist (x; Im (γ))} .$
$d_{γ} (v, v^{'})$	geodesic distance for $v, v^{'} \in Im (γ)$ , extended by $d_{γ} (x, x^{'}) := d_{γ} (π_{γ} (x), π_{γ} (x^{'}))$
$ℬ_{m} (x, R)$	ball of radius $R$ around a point $x$ , with respect to a metric $m$
$κ$	bound for the curvature of $γ$ , i.e. $κ = {‖ γ^{''} ‖}_{\infty}$
\hdashline $Q_{ℛ}, P_{ℛ}$	projections onto the tangent/normal space at ${\bar{t}}_{ℛ} = 𝔼 [t \| Y \in ℛ]$
\hdashline $Q_{ℛ}, P_{ℛ}$	here $P_{ℛ} = γ^{'} ({\bar{t}}_{ℛ}) γ^{'} {({\bar{t}}_{ℛ})}^{⊤}$ and $Q_{ℛ} = 𝖨𝖽 - P_{ℛ}$
probability
$(X, Y)$	random vector in $ℝ^{D} \times ℝ$ with a distribution $ρ$ , and the marginal of $X$ is $ρ_{X}$
$V, W$	random vectors such that $X = V + W$ , where $V = π_{γ} (X) \in Im (γ)$
$𝔼 X, Cov (X)$	the expectation and the covariance of a random variable $X$
$\hat{𝔼} X, \hat{Σ}$	empirical mean and sample covariance over all samples
$𝔼 [V \| ℛ], Cov (X \| ℛ)$	shorthand for conditional mean $𝔼 [V \| Y \in ℛ]$ and conditional covariance $Cov (X \| Y \in ℛ)$
${\hat{𝔼}}_{𝒰} X, {\hat{Σ}}_{𝒰}$	mean, and covariance, over samples that belong to $𝒰$ ; ${\hat{𝔼}}_{𝒰} X = \frac{1}{\| 𝒰 \|} \sum_{X \in 𝒰} X$
constants
$L_{f}$	the bi-Lipschitz constant $L_{f}$ of the function $g$ , see (10)
$J$	number of level sets, i.e. the size of the partitioning of the data; $𝒳 = \cup_{j = 1}^{J} {𝒳_{j}}$ , see (12)
$σ_{ε}$	bound on the noise term $ε$ , i.e., $\| ε \| \leq σ_{ε}$ , where $Y = f (X) + ε$ , see (A1)
$C_{W}$	constant in bounding influence of cross-covariance, see (A3)
$C_{⟂}$	lower-bound for non-zero eigenvalues in directions normal to $γ$ , see (A4)
$B$	bound for $dist (X; Im (γ))$ , see (A5)
$c_{V}$	uniformity constant for the distribution along $Im (γ)$ , see (A6)

Table 2. Table 2: Comparison of NSIM assumptions (A1) - (A6) with assumptions in SIM and manifold regression theory. Here a 𝑎 a denotes the (unit) index vector in SIM. Assumptions (A1) - (A4) , are common in the study of linear sufficient dimension reduction, whereas (A5) - (A6) reflect the constraints imposed by the non-linearity of the setting, and are common in manifold regression problems. We add though that (A5) is a significant relaxation of standard assumptions in manifold regression, which require B = 0 𝐵 0 B=0 or B ≪ τ γ much-less-than 𝐵 subscript 𝜏 𝛾 B\ll\tau_{\gamma} .

NSIM	implication on SIM	comparable assumption in the literature
(A1)	$Y = f (a^{⊤} x) + ε, ε ⟂ ⟂ X \| a^{⊤} X$	the setting is often studied in SIM literature, e.g. in [16, 38]
(A2)	$𝔼 [X \| P X] = P X$ for $P = a a^{⊤}$	integral part for inverse regression based techniques, usually implied by ellipticity, e.g. [29, 33, 34]
(A3)	implied by (A1) and (A2)	-
(A4)	$v^{⊤} Cov (X \| ℛ) v^{⊤} > C_{⟂}$ for all $v ⟂ a$ , $‖ v ‖ = 1$	implied by the constant conditional covariance assumption used sometimes for sufficient dimension reduction, e.g. [8, 27, 28]
(A5)	there exists $B > 0$ such that $‖ X ‖ \leq B < \infty$	existing methods require $B = 0$ to prove regression rates that do not depend exponentially on $D$ , e.g. [3, 22]
(A6)	$a^{⊤} X$ is absolutely continuous with respect to the Lebesgue measure on the image of $a^{⊤} X$	this is common to ensure that the manifold is covered well enough, e.g. [32]

Table 3. Table 3: RMSE, standard deviation, and cross-validated hyper-parameters, over 30 30 30 repetitions for several estimators and real-world data sets. Values for k 𝑘 k , J 𝐽 J , and for numbers of iterations and nodes, are averages over different runs of each experiment. First 5 5 5 rows describe the data sets and their characteristics, and the remaining rows contain the results. For a simplified presentation, we divide the mean and STD of RMSE, and the mean and STD of the data (5th row) by the value in row Factor .

Characteristics	Yacht	Istanbul	Ames	Concrete	Air Quality	Boston	Skillcraft
$\log$ -TF	Yes	No	Yes	No	No	Yes	Yes
$D, N$	$6, 307$	$7, 536$	$7, 1197$	$8, 1030$	$11, 7393$	$12, 506$	$16, 3338$
Factor	$10^{1}$	$10^{- 2}$	$10^{5}$	$10^{1}$	$10^{- 1}$	$10^{1}$	$10^{2}$
$\bar{Y} \pm STD (Y)$	$1.05 \pm 1.51$	$0.16 \pm 2.11$	$1.74 \pm 0.67$	$3.58 \pm 1.67$	$9.95 \pm 4.03$	$1.27 \pm 0.71$	$1.15 \pm 0.48$
Method
NSIM-dyad	$0.15 \pm 0.04$	$1.52 \pm 0.14$	$0.23 \pm 0.04$	$0.9 \pm 0.06$	$0.82 \pm 0.04$	$0.42 \pm 0.06$	$0.08 \pm 0.01$
$k$	$11.6$	$19.9$	$14.6$	$46.0$	$60.8$	$33.1$	$19.0$
$J$	$2.4$	$1.1$	$2.4$	$3.9$	$5.6$	$1.0$	$4.1$
NSIM-stat	$0.12 \pm 0.03$	$1.39 \pm 0.18$	$0.23 \pm 0.03$	$0.97 \pm 0.06$	$0.80 \pm 0.02$	$0.42 \pm 0.04$	$0.08 \pm 0.01$
$k$	$8.6$	$19.3$	$18.2$	$41.6$	$69.3$	$43.0$	$17.7$
$J$	$5.5$	$1.0$	$3.1$	$2.7$	$5.3$	$1.0$	$5.2$
Lin-Reg	$0.22 \pm 0.07$	$1.38 \pm 0.13$	$0.23 \pm 0.02$	$1.06 \pm 0.06$	$1.22 \pm 0.03$	$0.50 \pm 0.11$	$0.14 \pm 0.03$
kNN	$0.76 \pm 0.11$	$1.52 \pm 0.16$	$0.26 \pm 0.03$	$0.89 \pm 0.08$	$1.03 \pm 0.02$	$0.41 \pm 0.06$	$0.17 \pm 0.01$
$k$	$1.1$	$17.8$	$9.8$	$5.5$	$25.0$	$6.8$	$9.8$
SIR-kNN	$0.26 \pm 0.11$	$1.48 \pm 0.16$	$0.25 \pm 0.03$	$1.05 \pm 0.06$	$1.87 \pm 0.04$	$0.47 \pm 0.05$	$0.17 \pm 0.01$
$k$	$10.4$	$21.7$	$20.0$	$48.4$	$137.5$	$43.5$	$37.1$
$J$	$10.8$	$7.4$	$21.8$	$3.0$	$4.8$	$8.5$	$25.6$
Isotron	$0.15 \pm 0.05$	$1.42 \pm 0.11$	$0.24 \pm 0.03$	$1.03 \pm 0.05$	$0.83 \pm 0.03$	$0.42 \pm 0.05$	$0.08 \pm 0.01$
Iterations	$460.0$	$343.75$	$338.75$	$392.5$	$596.25$	$280.0$	$425.0$
ELM-Sig	$0.44 \pm 0.30$	$1.46 \pm 0.15$	$0.23 \pm 0.04$	$0.72 \pm 0.05$	$0.58 \pm 0.12$	$0.44 \pm 0.06$	$0.20 \pm 0.04$
Nodes	$88.8$	$15.2$	$54.0$	$86.3$	$91.8$	$46.2$	$77.1$
SNN-Tan	$0.48 \pm 0.20$	$1.61 \pm 0.21$	$0.25 \pm 0.04$	$0.80 \pm 0.07$	$0.14 \pm 0.04$	$0.41 \pm 0.05$	$0.04 \pm 0.01$
Nodes	$9.4$	$3.0$	$18.95$	$15.1$	$15.95$	$13.0$	$14.0$
SNN-Sig	$0.30 \pm 0.11$	$1.65 \pm 0.27$	$0.23 \pm 0.03$	$0.63 \pm 0.05$	$0.18 \pm 0.02$	$0.41 \pm 0.05$	$0.04 \pm 0.00$
Nodes	$13.0$	$3.9$	$8.1$	$16.9$	$21.5$	$7.5$	$10.4$

Table 4. Table 4: Bounds for the perturbation terms based Lemma 17 and and ( 44 ). C 𝐶 C is a universal constant. Here we used that η < ∞ 𝜂 \eta<\infty implies κ C W | 𝒮 | α ( λ C ⟂ ) − 1 / 2 ≤ 1 𝜅 subscript 𝐶 𝑊 superscript 𝒮 𝛼 superscript 𝜆 subscript 𝐶 perpendicular-to 1 2 1 {\kappa}C_{W}\left|{{\cal S}}\right|^{\alpha}({\lambda}C_{\perp})^{-1/2}\leq 1 to simplify the bounds.

Term	Bound multiplied with $C (\log (D) + u) N^{- 1 / 2}$	Shorthand notation
$‖ P Σ^{†} Δ P ‖$	$η (\frac{1}{λ} + \frac{B_{+}}{\sqrt{λ C_{⟂}}}) \leq 2 η θ$	$T_{1}$
$‖ Q Σ^{†} Δ Q ‖$	$η (\frac{B_{+}^{2}}{C_{⟂}} + \frac{B_{+}}{\sqrt{λ C_{⟂}}}) \leq 2 η θ$	$T_{2}$
$‖ Q Σ^{†} Δ P ‖$	$η \| 𝒮 \| (\frac{B_{+}}{C_{⟂}} + \frac{1}{\sqrt{λ C_{⟂}}}) \leq 2 η θ \| 𝒮 \|$	$T_{3}$
$‖ P Σ^{†} Δ Q ‖$	$η {\| 𝒮 \|}^{- 1} (\frac{B_{+}}{λ} + \frac{B_{+}^{2}}{\sqrt{λ C_{⟂}}}) \leq 2 η θ {\| 𝒮 \|}^{- 1}$	$T_{4}$
$‖ P Σ^{†} Δ Σ^{†} P ‖$	${(\frac{η}{\| 𝒮 \|})}^{2} {(\frac{1}{λ} + \frac{B_{+}}{\sqrt{C_{⟂} λ}})}^{2} \leq 4 {(\frac{η}{\| 𝒮 \|})}^{2} θ^{2}$	$T_{5}$
$‖ Q Σ^{†} Δ Σ^{†} Q ‖$	$η^{2} {(\frac{B_{+}}{C_{⟂}} + \frac{1}{\sqrt{C_{⟂} λ}})}^{2} \leq 4 η^{2} θ^{2}$	$T_{6}$
$‖ P Σ^{†} Δ Σ^{†} Q ‖$	$\frac{η^{2}}{\| 𝒮 \|} (\frac{1}{λ} + \frac{B_{+}}{\sqrt{C_{⟂} λ}}) (\frac{B_{+}}{C_{⟂}} + \frac{1}{\sqrt{C_{⟂} λ}}) \leq 4 \frac{η^{2} θ^{2}}{\| 𝒮 \|}$	$T_{7} \leq \sqrt{T_{5} T_{6}}$

Equations293

Y = f (X) + ε, for f (X) = g (⟨ a, X ⟩),

Y = f (X) + ε, for f (X) = g (⟨ a, X ⟩),

Y = f (X) + ε, for f (X) = g (π_{γ} (X)),

Y = f (X) + ε, for f (X) = g (π_{γ} (X)),

π_{γ} (x) \in z \in Im (γ) argmin ∥ x - z ∥ .

π_{γ} (x) \in z \in Im (γ) argmin ∥ x - z ∥ .

E [Y ∣ X, f (X) \in R_{j}] \approx g_{j} (⟨ a_{j}, X ⟩), j = 1, \dots, J,

E [Y ∣ X, f (X) \in R_{j}] \approx g_{j} (⟨ a_{j}, X ⟩), j = 1, \dots, J,

Y_{j} := Y \cap R_{j}, X_{j} := {X_{i} \in X : Y_{i} \in Y_{j}} .

Y_{j} := Y \cap R_{j}, X_{j} := {X_{i} \in X : Y_{i} \in Y_{j}} .

\hat{b}_{j} := P_{k e r (\hat{Σ}_{j})} ω = 0 argmin \hat{E}_{(X_{j}, Y_{j})} (Y - \hat{E}_{Y_{j}} Y - ⟨ ω, X - \hat{E}_{X_{j}} X ⟩)^{2},

\hat{b}_{j} := P_{k e r (\hat{Σ}_{j})} ω = 0 argmin \hat{E}_{(X_{j}, Y_{j})} (Y - \hat{E}_{Y_{j}} Y - ⟨ ω, X - \hat{E}_{X_{j}} X ⟩)^{2},

\hat{b}_{j} := \hat{Σ}_{j}^{†} \hat{E}_{(X_{j}, Y_{j})} ((Y - \hat{E}_{Y_{j}} Y) (X - \hat{E}_{X_{j}} X)) .

\hat{b}_{j} := \hat{Σ}_{j}^{†} \hat{E}_{(X_{j}, Y_{j})} ((Y - \hat{E}_{Y_{j}} Y) (X - \hat{E}_{X_{j}} X)) .

Δ_{η} (x, X_{i}) := {\overset{a}{^} (X_{i})^{⊤} (x - X_{i}) \infty if ∥ x - X_{i} ∥ \leq η, else,,

Δ_{η} (x, X_{i}) := {\overset{a}{^} (X_{i})^{⊤} (x - X_{i}) \infty if ∥ x - X_{i} ∥ \leq η, else,,

\hat{f}_{k} (x) := \frac{1}{k} i = 1 \sum k Y_{i (x)} .

\hat{f}_{k} (x) := \frac{1}{k} i = 1 \sum k Y_{i (x)} .

i \in [N] max \overset{a}{^} (X_{i}) - γ^{'} (γ^{- 1} \circ π_{γ} (X_{i})) ≲ \frac{κ}{J} + \frac{lo g ( J )}{N J},

i \in [N] max \overset{a}{^} (X_{i}) - γ^{'} (γ^{- 1} \circ π_{γ} (X_{i})) ≲ \frac{κ}{J} + \frac{lo g ( J )}{N J},

L_{f}^{- 1} d_{γ} (v, v^{'}) \leq ∣ g (v) - g (v^{'}) ∣ \leq L_{f} d_{γ} (v, v^{'}), for all v, v^{'} \in Im (γ) .

L_{f}^{- 1} d_{γ} (v, v^{'}) \leq ∣ g (v) - g (v^{'}) ∣ \leq L_{f} d_{γ} (v, v^{'}), for all v, v^{'} \in Im (γ) .

π_{D} (X) - π_{D^{'}} (X) = E [X ∣ π_{D (X)}] - E [X ∣ π_{D^{'} (X)}], a.s.

π_{D} (X) - π_{D^{'}} (X) = E [X ∣ π_{D (X)}] - E [X ∣ π_{D^{'} (X)}], a.s.

E [X ∣ f (X)] = E [X ∣ g (π_{D^{'}} (X))] = E [X ∣ π_{D^{'}} (X)],

E [X ∣ f (X)] = E [X ∣ g (π_{D^{'}} (X))] = E [X ∣ π_{D^{'}} (X)],

v^{⊤} Cov (X ∣ R) v > C_{⊥} > 0.

v^{⊤} Cov (X ∣ R) v > C_{⊥} > 0.

let R_{j} := [\frac{j - 1}{J}, \frac{j}{J}] and define Y_{j} := Y \cap R_{j}, and X_{j} := {X_{i} \in X : Y_{i} \in Y_{j}} .

let R_{j} := [\frac{j - 1}{J}, \frac{j}{J}] and define Y_{j} := Y \cap R_{j}, and X_{j} := {X_{i} \in X : Y_{i} \in Y_{j}} .

\hat{b}_{j} := \hat{Σ}_{j}^{†} \hat{E}_{(X_{j}, Y_{j})} (Y - \hat{E}_{Y_{j}} Y) (X - \hat{E}_{X_{j}} X) .

\hat{b}_{j} := \hat{Σ}_{j}^{†} \hat{E}_{(X_{j}, Y_{j})} (Y - \hat{E}_{Y_{j}} Y) (X - \hat{E}_{X_{j}} X) .

∥ \overset{a}{^} (X_{i}) - a (X_{i}) ∥ \leq \overset{a}{^}_{j (X_{i})} - a_{j (X_{i})} + a_{j (X_{i})} - a (X_{i}) \leq \overset{a}{^}_{j (X_{i})} - a_{j (X_{i})} + κ_{j} ∣ S_{j} ∣,

∥ \overset{a}{^} (X_{i}) - a (X_{i}) ∥ \leq \overset{a}{^}_{j (X_{i})} - a_{j (X_{i})} + a_{j (X_{i})} - a (X_{i}) \leq \overset{a}{^}_{j (X_{i})} - a_{j (X_{i})} + κ_{j} ∣ S_{j} ∣,

4 σ_{ε} < J^{- 1} < (\frac{2}{3})^{3/2} \frac{σ _{j, Y} σ _{⊥}}{L _{f} κ _{j} C _{W}^{*}}, and ∣ X_{j} ∣ \geq max {C_{N} (lo g (D) + u)^{2}, D},

4 σ_{ε} < J^{- 1} < (\frac{2}{3})^{3/2} \frac{σ _{j, Y} σ _{⊥}}{L _{f} κ _{j} C _{W}^{*}}, and ∣ X_{j} ∣ \geq max {C_{N} (lo g (D) + u)^{2}, D},

P (∥ \overset{a}{^}_{j} - a_{j} ∥ \leq C_{A} \frac{κ _{j}}{J ^{2}} + C_{E} \frac{lo g ( D ) + u}{∣ X _{j} ∣ J}) \geq 1 - exp (u) .

P (∥ \overset{a}{^}_{j} - a_{j} ∥ \leq C_{A} \frac{κ _{j}}{J ^{2}} + C_{E} \frac{lo g ( D ) + u}{∣ X _{j} ∣ J}) \geq 1 - exp (u) .

4 σ_{ε} < \frac{1}{J} < (\frac{2}{3})^{3/2} \frac{σ _{J, Y} σ _{⊥}}{L _{f} κ C _{W}^{*}}, where σ_{J, Y} := j \in [J] max \frac{Var ( a ^{⊤} X , Y ∣ R _{j} )}{∣ S _{j} ∣ ∣ R _{j} ∣},

4 σ_{ε} < \frac{1}{J} < (\frac{2}{3})^{3/2} \frac{σ _{J, Y} σ _{⊥}}{L _{f} κ C _{W}^{*}}, where σ_{J, Y} := j \in [J] max \frac{Var ( a ^{⊤} X , Y ∣ R _{j} )}{∣ S _{j} ∣ ∣ R _{j} ∣},

N \geq C_{N} max {(lo g (D) + lo g (J) u)^{2}, D} u J,

N \geq C_{N} max {(lo g (D) + lo g (J) u)^{2}, D} u J,

P (i \in [N] max ∥ \overset{a}{^} (X_{i}) - a (X_{i}) ∥ \leq C_{A} \frac{κ}{J} + C_{E} \frac{lo g ( D ) u + lo g ( J ) u ^{2}}{N J}) \geq 1 - exp (u) .

P (i \in [N] max ∥ \overset{a}{^} (X_{i}) - a (X_{i}) ∥ \leq C_{A} \frac{κ}{J} + C_{E} \frac{lo g ( D ) u + lo g ( J ) u ^{2}}{N J}) \geq 1 - exp (u) .

d_{γ} (x, \overset{x}{ˉ}_{k (x)}) \leq \frac{2 \lor ( ∣ I ∣ + 2 B )}{θ - κ B} (d_{γ} (x, \overset{x}{ˉ}_{k^{*} (x)}) + i \in [N] max ∥ \overset{a}{^} (\overset{x}{ˉ}_{i}) - a (\overset{x}{ˉ}_{i}) ∥) .

d_{γ} (x, \overset{x}{ˉ}_{k (x)}) \leq \frac{2 \lor ( ∣ I ∣ + 2 B )}{θ - κ B} (d_{γ} (x, \overset{x}{ˉ}_{k^{*} (x)}) + i \in [N] max ∥ \overset{a}{^} (\overset{x}{ˉ}_{i}) - a (\overset{x}{ˉ}_{i}) ∥) .

∥ \overset{a}{^} (X_{ℓ}^{'}) - a (X_{ℓ}^{'}) ∥

∥ \overset{a}{^} (X_{ℓ}^{'}) - a (X_{ℓ}^{'}) ∥

\leq i \in [N] max ∥ \overset{a}{^} (X_{i}) - a (X_{i}) ∥ + κ d_{γ} (X_{ℓ}^{'}, X_{1 (X_{ℓ}^{'})})

\leq \frac{1 + κ ( 2 \lor ( ∣ I ∣ + 2 B ))}{θ - κ B} (i \in [N] max ∥ \overset{a}{^} (X_{i}) - a (X_{i}) ∥ + d_{γ} (X_{ℓ}^{'}, X_{1^{*} (X_{ℓ}^{'})}))

\hat{f}_{k} (x) - f (x) \leq C \frac{σ _{ε} u}{k} + \frac{C _{B}}{( θ - κ B ) ^{2}} (u \frac{k}{N} + C_{E} \frac{lo g ( D ) u + lo g ( J ) u ^{2}}{N J} + C_{A} \frac{κ}{J}) .

\hat{f}_{k} (x) - f (x) \leq C \frac{σ _{ε} u}{k} + \frac{C _{B}}{( θ - κ B ) ^{2}} (u \frac{k}{N} + C_{E} \frac{lo g ( D ) u + lo g ( J ) u ^{2}}{N J} + C_{A} \frac{κ}{J}) .

\hat{f}_{k} (x) - f (x) = \frac{1}{k} ℓ = 1 \sum k Y_{ℓ (x)}^{'} - f (x) \leq \frac{1}{k} ℓ = 1 \sum k ε_{ℓ (x)}^{'} + \frac{1}{k} ℓ = 1 \sum k f (X_{ℓ (x)}^{'}) - f (x) .

\hat{f}_{k} (x) - f (x) = \frac{1}{k} ℓ = 1 \sum k Y_{ℓ (x)}^{'} - f (x) \leq \frac{1}{k} ℓ = 1 \sum k ε_{ℓ (x)}^{'} + \frac{1}{k} ℓ = 1 \sum k f (X_{ℓ (x)}^{'}) - f (x) .

P (\frac{1}{k} ℓ = 1 \sum k ε_{ℓ (x)}^{'} \leq C \frac{σ _{ε} u}{k}) \geq 1 - exp (- u^{2}) \geq 1 - exp (- u) .

P (\frac{1}{k} ℓ = 1 \sum k ε_{ℓ (x)}^{'} \leq C \frac{σ _{ε} u}{k}) \geq 1 - exp (- u^{2}) \geq 1 - exp (- u) .

\frac{1}{k} ℓ = 1 \sum k f (X_{ℓ (x)}^{'}) - f (x)

\frac{1}{k} ℓ = 1 \sum k f (X_{ℓ (x)}^{'}) - f (x)

\frac{1}{k} ℓ = 1 \sum k f (X_{ℓ (x)}^{'}) - f (x)

\frac{1}{k} ℓ = 1 \sum k f (X_{ℓ (x)}^{'}) - f (x)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

soply/local_sim_experiments
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Regression

Full text

**

**Nonlinear generalization of the monotone single index model

Željko Kereta Email: [email protected] Simula Research Laboratory, Machine Intelligence Department, Oslo, Norway

Timo Klock Email: [email protected] Simula Research Laboratory, Machine Intelligence Department, Oslo, Norway

Valeriya Naumova Email: [email protected] SimulaMet, Machine Intelligence Department, Oslo, Norway

Abstract

Single index model is a powerful yet simple model, widely used in statistics, machine learning, and other scientific fields. It models the regression function as $g(\left<{a},{x}\right>)$ , where $a$ is an unknown index vector and $x$ are the features. This paper deals with a nonlinear generalization of this framework to allow for a regressor that uses multiple index vectors, adapting to local changes in the responses. To do so we exploit the conditional distribution over function-driven partitions, and use linear regression to locally estimate the index vectors. We then regress by applying a kNN type estimator that uses a localized proxy of the geodesic metric. We present theoretical guarantees for estimation of local index vectors and out-of-sample prediction, and demonstrate the performance of our method with experiments on synthetic and real-world data sets, comparing it with state-of-the-art methods.

Keywords: high-dimensional regression, dimension reduction, single index model, nonparametric regression, nonlinear methods

1 Introduction

Many problems in data analysis can be formulated as learning a function from a given data set in a high-dimensional space. Due to the curse of dimensionality, accurate regression on high-dimensional functions typically requires a number of samples that scales exponentially with the ambient dimension [41]. A common approach to mitigating these effects is to impose structural assumptions on the data. Indeed, a number of recent advances in data analysis and numerical simulation are based on the observation that high-dimensional, real-world data is inherently structured, and that the relationship between the features and the responses is of a lower dimensional nature [1].

The most direct such model, which has become an important prior for many statistical and machine learning paradigms, considers a $1$ -dimensional relationship of the form

[TABLE]

where $\varepsilon$ is a random noise term, and the features $X\in\mathbb{R}^{D}$ and responses $Y\in\mathbb{R}$ are related through an unknown index vector $a\in\mathbb{R}^{D}$ and an unknown monotonic function $g$ . Model (1) is called the single index model (SIM), and it first appeared in economical and statistical communities in the early 90s [15, 18]. Moreover, SIM provides a basis for more complex models such as multi-index models [5, 28, 42] and neural networks [24].

An assumption shared by SIM and generalizations is that there is a single lower dimensional linear space that accounts for the complexity in relating $X$ and $Y$ . While simple, this assumption is only a first level approximation and is rarely observed in real-world regression problems. The goal of this paper is to relax the assumption on global linearity in the model (1), in order to locally adapt to changes in the relationship between $X$ and $Y$ . Specifically, we propose the nonlinear single index model (NSIM), defined by

[TABLE]

where $\varepsilon$ is a random noise term, $g$ is a bi-Lipschitz function, $\gamma:{\cal I}\rightarrow\mathbb{R}^{D}$ is a parametrization of a ${\cal C}^{2}$ curve $\operatorname{Im}(\gamma)$ , and $\pi_{\gamma}$ is the corresponding orthogonal projection, defined by

[TABLE]

Function $g$ can be seen as a univariate scalar function, defined on the parametrization domain ${\cal I}$ , provided $\operatorname{Im}(\gamma)$ is a simple curve. This identification is useful for defining examples of the setting, and reveals SIM as a special example of (2), where $\gamma(t)=at$ .

Before formally describing the assumptions and details of our approach, let us begin with a couple of comments. Recall that smooth curves can be locally approximated by affine approximations, i.e., $\pi_{\gamma}(x)\approx\left<{a_{j}},{x}\right>+c_{j},$ where $a_{j}$ is a local tangent vector of $\operatorname{Im}(\gamma)$ . Problem (2) can therefore be approximated by a family of problems of the type $f(x)\approx g_{j}\left(\left<{a_{j}},{x}\right>\right)$ , where $j$ corresponds to pieces of $\operatorname{Im}(\gamma)$ that are approximately affine. Notice now that due to the monotonicity of $g$ , the proximity of $f(x)$ and $f(x^{\prime})$ implies the proximity of $\pi_{\gamma}(x)$ and $\pi_{\gamma}(x^{\prime})$ , and vice versa. Therefore, instead of looking at approximately affine pieces of $\operatorname{Im}(\gamma)$ , we can equivalently consider a partition of $\operatorname{Im}(f)$ , consisting of disjoint intervals $\mathcal{{\cal R}}_{j}$ , and split (2) into a family of localized SIM problems

[TABLE]

where tangent vectors $a_{j}$ now play the role of index vectors in (1).

In Figure 1 we study the effects of such an approach on several UCI data sets111https://archive.ics.uci.edu/ml/datasets.html. Namely, for each data set we partition the data into $J$ sets, as detailed above, learn a SIM estimator on each of the $J$ sets, and then plot the generalization error of the resulting estimator as a function of the hyperparameter $J$ . Given sufficient amount of data, we can observe that replacing (1) with (4), its localized counterpart, often returns better estimation results. For example, on the Yacht data set the generalization error improves by more than $30$ percent for $J=5$ compared to SIM. Notice though that increasing the number of localized pieces does not always improve the performance. This can mostly be attributed to the fact that splitting the original data set into disjoint subgroups reduces the number of samples within each group, which has a detrimental effect on the variance of the estimator. In other words, we face a typical bias-variance trade-off, implying that hyperparameter $J$ needs to be carefully selected. Furthermore, sometimes a SIM is indeed the best fit to the data (e.g. Boston data set). As shown in the experiments in Section 5, this will be detected by our approach when combined with cross-validation to choose $J$ .

Related work.

To the best of our knowledge, relaxations of SIM have not yet been considered in this form. However, three research areas are highly relevant: linear single- and multi-index models, nonlinear sufficient dimension reduction, and manifold regression. Below we provide a short overview of the most significant and relevant achievements in each of these fields.

Single- and multi index models have been extensively researched, and we therefore, restrict ourselves to conceptually related work. Most studies focus only on the estimation of index vector(s), which started with the early work on linear regression based methods [4, 14, 31, 40]. The most relevant work is [16], where the authors use iterative local linear regression to estimate the index vector $a$ . Locality is enforced by kernel weights, which are initially set to be spherical around the estimation point, and then iteratively reshaped so that the isolines resemble level set boundaries of a strictly monotonous link function $g$ . This approach has been extended to the case of multiple index vectors [7], estimating instead the corresponding index space.

Another relevant line of work are methods based on inverse regression that began with the introduction of sliced inverse regression (SIR) [29]. This was followed by SAVE [8], PHD [30] , MAVE [46], Contour regression [28], Directional regression [27], etc. The common thread shared by these methods is the use of inverse moments, such as $\mathbb{E}[X|Y]$ and $\operatorname{Cov}\left({X|Y}\right)$ , to estimate the index vector or the index space.

Several methods simultaneously learn the link function and the index vector. We mention Isotron [20] and Slisotron [19], which iteratively update the link function and the index vector; [10] that additionally assumes sparsity of the index vector; [6, 23] that use an iterative procedure and spline estimates; [38] that uses higher dimensional splines.

On the other hand, methods and theory for nonlinear sufficient dimension reduction are still in the early stages and there are many open questions. Most of the existing studies consider kernelized versions of linear estimators (such as SIR or SAVE) to globally linearize the problem in feature space, and then apply well-known regression methods, see [25, 26, 45, 47].

Model (2) can also be considered from the viewpoint of manifold regression, where the goal is to estimate a function $f:{\cal M}\rightarrow\mathbb{R}$ defined on the data. Manifold regression methods, such as [3, 22, 36], generally assume that the marginal distribution of $X$ is either supported on ${\cal M}$ or in its close vicinity. As a consequence, Euclidean distances can be used to locally approximate the geodesic metric. This is a strong assumption which is implicitly or explicitly leveraged by all manifold regression techniques, and presents a breaking point for their effective use. In this work, we instead consider distributions that are spread in all directions of the ambient space around the curve $\gamma$ . Consequently, geodesic proximity cannot be inferred from Euclidean distances and we instead need to locally approximate the geodesic distance.

Main idea and estimation procedure for the NSIM model.

Model (2) increases the flexibility of the ordinary SIM by allowing for varying index vectors, corresponding to different regimes of the response $f(X)$ . Consequently, a natural approach would be to partition the data into several groups, based on $Y$ , and use a SIM-like estimator to approximate the index vector and the regression function. In particular, our approach follows three steps.

In the first step we partition the data set $\{(X_{i},Y_{i}):i\in[N]\}$ into $J$ sets, $\{{\cal X}_{j}:j\in[J]\}$ and $\{{\cal Y}_{j}:j\in[J]\}$ . To do so we define a disjoint union of the responses, $\operatorname{Im}(Y)=\cup_{j=1}^{J}{\cal R}_{j}$ for intervals ${\cal R}_{j}$ , and then set

[TABLE]

We refer to sets ${\cal X}_{j}$ as level sets, since they can be defined as ${\cal X}_{j}={\cal X}\cap f^{-1}({\cal R}_{j})$ in the noise-free case. The optimal method for partitioning $\operatorname{Im}(Y)$ as $\cup_{j=1}^{J}{\cal R}_{j}$ depends on the marginal distribution of $Y$ , and is best chosen after inspecting the empirical density. For example, we suggest using dyadic cells of $[\min Y,\max Y]$ if the density of $Y$ is roughly uniform, and stochastically equivalent blocks if the probability mass is unevenly distributed.

In the second step we compute estimates $\{\hat{a}_{j}:j\in[J]\}$ of local index vectors by using linear regression on ${\cal X}_{j}$ and ${\cal Y}_{j}$ . Namely, let $\hat{\Sigma}_{j}:=\hat{\mathbb{E}}_{{\cal X}_{j}}{(X-\hat{\mathbb{E}}_{{\cal X}_{j}}{X})(X-\hat{\mathbb{E}}_{{\cal X}_{j}}{X})^{\top}}$ be the standard finite sample estimate for the conditional covariance $\operatorname{Cov}\left({X|Y\in{\cal R}_{j}}\right)$ , where $\hat{\mathbb{E}}$ denotes the empirical expectation. Then, set $\hat{a}_{j}=\hat{b}_{j}/\|\hat{b}_{j}\|$ , where $\hat{b}_{j}$ is the solution of linear regression,

[TABLE]

or equivalently,

[TABLE]

Intuitively, vectors $\hat{a}_{j}$ correspond to directions in which the function changes, and therefore approximates local gradient directions of $f$ . In the case of an ordinary SIM, it has been shown in [2] that the direction of the global linear regression vector, denoted by $\hat{a}$ , is an unbiased estimator of index vector $a$ , if $X$ has elliptical distribution. Furthermore, $\sqrt{N}(\hat{a}-a)$ is asymptotically normal, hence $\hat{a}$ achieves $N^{-1/2}$ -consistency. As we will see in Sections 2 and 3, in our case the analysis of $\hat{a}_{j}$ is more challenging due to the underlying nonlinear geometry.

In the final step we use a kNN-type estimator to predict $f(x)$ for an out-of-sample $x$ . Since the make-or-break point of kNN-estimators regards how are distances between $x$ and training samples measured, the critical point of this step is about the selection of an appropriate distance function. The issue is that the optimal choice (the geodesic metric on $\operatorname{Im}(\gamma)$ ) is not available since $\operatorname{Im}(\gamma)$ is not known, and the naive choice (the Euclidean metric) generally leads to estimation rates that depend on the ambient dimension, and thus the curse of dimensionality.

To develop a proxy metric, consider now the ordinary SIM. Here the geodesic metric is equivalent to the Euclidean distance of projected samples if $\hat{a}$ approximates the true index vector $a$ with a sufficiently high rate, i.e., $\left|{\left<{\hat{a}},{(x-x^{\prime})}\right>}\right|$ is a good proxy for the geodesic metric provided $\left\|{\hat{a}-a}\right\|$ is small. Moreover, training the kNN estimator on projected samples $(\left<{\hat{a}},{X_{i}}\right>,Y_{i})$ achieves optimal univariate regression rates. The NSIM case is more challenging because first, we have $J$ different index vectors to choose from, and second, $x$ cannot be a priori assigned to any level set since $f(x)$ is unknown. Still, if we assign to each sample $X_{i}$ the index vector $\hat{a}(X_{i}):=\hat{a}_{j(X_{i})}$ , where $j(X_{i})$ is the unique level set with $X_{i}\in{\cal X}_{j(X_{i})}$ , we can show that

[TABLE]

approximates the geodesic metric $d_{\gamma}(\pi_{\gamma}(x),\pi_{\gamma}(X_{i}))$ reasonably well, under suitable choice of the restricting radius $\eta$ , see Section 4.2. In the special case of a perturbed SIM, where $\gamma$ is not too far from an affine space, this is also true for $\eta=\infty$ , see Section 4.1.

This motivates the following estimator: let $(X_{i(x)},Y_{i(x)})$ denote the $i$ -th closest sample to $x$ when measured in $\Delta_{\eta}({x},{\cdot})$ , and where ties can be broken arbitrarily. Then set

[TABLE]

As we will discuss in Section 4.2, the radius $\eta$ plays a dual role. It needs to be large enough so that there are enough samples to choose neighbors from, but small enough so that (8) is a good proxy for the geodesic metric. The entire estimation approach is summarized in Algorithm 1.

Computational complexity.

The first two steps, partitioning and computing tangents, are dominated by ${\cal O}(\min\{JD^{3},JND^{2}\}+ND^{2})$ , which is mostly due to forming covariance matrices and computing the generalized inverse. Out-of-sample prediction requires ${\cal O}(N+JD)$ operations per evaluation.

Contributions and organization of the paper.

In this work we introduce a nonlinear generalization of the SIM and study estimation of the model from $N$ given data points $\{(X_{i},Y_{i}):i\in[N]\}$ sampled iid. from an unknown distribution $\rho$ . The presented model synthesizes the fields of linear sufficient dimension reduction and manifold regression, thereby attempting to extend both. We first develop a rigorous mathematical framework, in Section 2, through which NSIM can be theoretically analyzed.

We provide a simple and efficient estimator based on output-conditional linear regression and kNN-regression. Theoretical guarantees of the approach are subjects of Sections 3 (local index vectors) and 4 (function estimation). In summary, we achieve optimal estimation rates [13, 21] in the noise-free scenario ( $\varepsilon=0$ almost surely), or if the data follows the ordinary SIM. In the general case, the estimator remains biased.

The theoretical analysis on local index vector (or tangent field) estimation requires a careful study of (conditional) ordinary linear regression (6). In particular, two sources of error are present: a bias term, that decays when increasing the number $J$ of subsets in the level set partition, and a variance term, that decays with the number of samples $N$ . Our analysis reveals a concentration bound of the form

[TABLE]

where $\kappa$ is a curvature bound for $\operatorname{Im}(\gamma)$ . This is a surprising result because both the bias and the variance decrease with $J$ (as long as the noise $\varepsilon$ is negligible compared to the $J^{-1}$ ). This observation is a key component for establishing optimal regression rates in the noise-free case.

For the regression analysis, we show in Section 4 that $\Delta_{\eta}({x},{\cdot})$ is equivalent to the geodesic metric $d_{\gamma}(\pi_{\gamma}(x),\cdot)$ , up to an error made in tangent field estimation. This suffices to establish aforementioned kNN-regression guarantees. These results are relevant from a more general perspective, because they can readily be used with other means of estimating the tangent field, and can be extended to higher dimensional manifolds.

In Section 5 we conclude the paper with extensive numerical tests on synthetic and real data sets, that have previously been used as benchmarks for the SIM model. The results show that the extended flexibility of NSIM is beneficial for both, out-of-sample prediction and model interpretability.

General notation.

We use $[N]=\{1,\ldots,N\}$ for $N\in\mathbb{N}$ . $\left\|{\cdot}\right\|$ denotes the Euclidean norm for vectors, and the spectral norm for matrices. $d_{\gamma}$ denotes the geodesic metric on $\operatorname{Im}(\gamma)$ . Provided that $\gamma$ is an arc-length parametrization, this means $d_{\gamma}(\gamma(t_{1}),\gamma(t_{2}))=\left|{t_{1}-t_{2}}\right|$ . We extend the notation to $x,x^{\prime}\in\mathbb{R}^{D}$ by setting $d_{\gamma}(x,x^{\prime}):=d_{\gamma}(\pi_{\gamma}(x),\pi_{\gamma}(x^{\prime}))$ whenever projections $\pi_{\gamma}(\cdot)$ are uniquely defined. For a discrete set of points $A=\{x_{1},\ldots,x_{k}\}\subset\mathbb{R}^{D}$ we use $\left|{A}\right|$ to denote its number of elements. On the other hand, if $A$ is a connected subsegment of $\operatorname{Im}(\gamma)$ or if $A\subset{\cal R}$ is an interval, then $\left|{A}\right|$ denotes its length. By an interval $A\subset\mathbb{R}$ we always refer to a closed and connected subset of the real line. We use $a\vee b=\max\{a,b\}$ and $a\wedge b=\min\{a,b\}$ . The Moore-Penrose inverse of a matrix $M$ is denoted by $M^{\dagger}$ .

The abbreviation a.s. is used as a shorthand for almost sure events (with respect to implicit random vectors), and iid. refers to independent and identically distributed data sampling. Table 1 contains an overview of notation and constants used in this paper.

2 Theoretical framework for the NSIM model

Due to the broadness of its scope, it is relatively easy to construct examples of NSIM that fit the model but for which estimation from finite samples is not possible. The goal in this section is to define a framework that allows a rigorous analysis, yet is broad enough to encompass both the SIM and its nonlinear generalization NSIM. In the following we describe the assumptions on the function class, the underlying nonlinearity $\operatorname{Im}(\gamma)$ , and on the distribution of the data set.

Regularity assumptions for $f$ and $\operatorname{Im}(\gamma)$ .

Let $\gamma:{\cal I}\rightarrow\mathbb{R}^{D}$ , for an interval ${\cal I}\subset\mathbb{R}$ , be an arc-length parametrization of a simple, connected, and ${\cal C}^{2}$ smooth curve, denoted $\operatorname{Im}(\gamma)=\gamma({\cal I})$ , and set $\kappa={\left\|{\gamma^{\prime\prime}}\right\|}_{\infty}<\infty$ . We consider Lipschitz functions $f:\Omega\subset\mathbb{R}^{D}\rightarrow\mathbb{R}$ that satisfy $f(x)=g(\pi_{\gamma}(x))$ for some ${L_{f}}$ -bi-Lipschitz function $g:\operatorname{Im}(\gamma)\rightarrow\mathbb{R}$ , that is

[TABLE]

Through rescaling we can always assume $\operatorname{Im}(f)=[0,1]$ . We can, without loss of generality, align $\gamma$ with $\nabla f$ , i.e., choose an orientation such that $\left\langle\nabla f(\gamma(t)),\gamma^{\prime}(t)\right\rangle>0$ , for almost every $t\in{\cal I}$ . An important quantity is the reach $\tau_{\gamma}$ of $\operatorname{Im}(\gamma)$ - the largest $r>0$ such that any point at distance less than $r$ from $\operatorname{Im}(\gamma)$ has a unique nearest point on $\operatorname{Im}(\gamma)$ [9]. This ensures that $\pi_{\gamma}(x)$ , and thus $f(x)$ , is well defined for all $x$ within the reach, i.e., all $x$ such that $\min_{z\in\operatorname{Im}(\gamma)}\left\|{x-z}\right\|<\tau_{\gamma}$ .

Distributional assumptions.

We consider distributions $\rho$ for which the distribution of $X|Y\in{\cal R}$ is absolutely continuous with respect to the Lebesgue measure on $\operatorname{Im}(\operatorname{Cov}\left({X|{\cal R}}\right))$ for any non-empty interval ${\cal R}\subset[0,1]$ , and which satisfy assumptions (A1) - (A6) below.

Assumptions (A1) - (A4) are related to single- and multi-index model literature (or more broadly sufficient dimension reduction literature, see [34] for a review), whereas (A5) - (A6) are related to manifold regression. We begin by describing the behavior of the noise $\varepsilon$ .

(A1)

For $\varepsilon:=Y-\mathbb{E}[Y|X]=Y-g(\pi_{\gamma}(X))$ , we assume $\varepsilon\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X|\pi_{\gamma}(X)$ , and $\left|{\varepsilon}\right|\leq\sigma_{\varepsilon}$ a.s..

In sufficient dimension reduction problems, $\varepsilon\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X|\pi_{\gamma}(X)$ is often more commonly written $Y\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X|\pi_{\gamma}(X)$ .

The next assumption states that $\operatorname{Im}(\gamma)$ is centered in the middle of the distribution.

(A2)

$\mathbb{E}[X|\pi_{\gamma}(X)]=\pi_{\gamma}(X)$ holds $\pi_{\gamma}(X)$ -a.s.

This is inspired by the linear condition mean assumption from single- and multi-index model literature, and is an integral component of every method based on inverse regression [33, 34]. It is needed to ensure the recovery of a subspace of the index space in the population regime $(N\rightarrow\infty)$ , see e.g. [8, 29], and is often ensured by a stronger condition: if $X$ is elliptically distributed [33]. (A2) also implies identifiability of $\operatorname{Im}(\gamma)$ by the distribution of $(X,Y)$ .

Lemma 1.

Let ${\cal D},{\cal D}^{\prime}\subset\mathbb{R}^{D}$ , with orthogonal projections $\pi_{D},\pi_{D^{\prime}}$ defined according to (3), and let $X$ be a random vector such that $\pi_{D}(X)$ and $\pi_{D^{\prime}}(X)$ are a.s. unique. Let $g:{\cal D}\rightarrow\mathbb{R}$ and $g^{\prime}:{\cal D}^{\prime}\rightarrow\mathbb{R}$ be measurable and injective. If $f=g\circ\pi_{{\cal D}}=g^{\prime}\circ\pi_{{\cal D}^{\prime}}$ , and $\mathbb{E}[X-\pi_{{\cal D}}(X)|\pi_{{\cal D}}(X)]=\mathbb{E}[X-\pi_{{\cal D}^{\prime}}(X)|\pi_{{\cal D}^{\prime}}(X)]$ a.s., then $\pi_{{\cal D}}(X)=\pi_{{\cal D}^{\prime}}(X)$ a.s..

Proof.

Due to the assumption we have

[TABLE]

Since conditioning on an injective function of a random variable is equivalent with conditioning on the random variable itself, we get

[TABLE]

and similarly for $\pi_{{\cal D}}(X)$ . Plugging into (11) the claim follows. ∎

In the linear case (A1) and (A2) imply $\operatorname{Cov}\left({PX,QX|{\cal R}}\right)=0$ for any interval ${\cal R}\subset[0,1]$ , where $P$ is the orthoprojector onto the index space, and $Q=\mathsf{Id}-P$ . For some single- or multi-index model estimators this suffices to ensure the recovery of the index space in the population regime, see e.g. [29]. In the nonlinear case however, due to curvature we require an additional assumption. Let $t:=\gamma^{-1}\circ\pi_{\gamma}(X)\in{\cal I}$ be the induced random variable and define the mean $\bar{t}_{{\cal R}}:=\mathbb{E}[t|{\cal R}]$ , the tangential projection $P_{{\cal R}}:=\gamma^{\prime}(\bar{t}_{{\cal R}})\gamma^{\prime}(\bar{t}_{{\cal R}})^{\top}$ , and the orthogonal projection $Q_{{\cal R}}:=\mathsf{Id}-P_{{\cal R}}$ , see Figure 2. Furthermore, let ${\cal S}\subset\operatorname{Im}(\gamma)$ be the shortest connected segment with $\mathbb{P}(V\in{\cal S}|{\cal R})=1$ .

(A3)

There exists an absolute constant $C_{W}>0$ such that $\left\|{\operatorname{Cov}\left({Q_{{\cal R}}X,P_{{\cal R}}X\lvert{\cal R}}\right)}\right\|\leq\kappa C_{W}\left|{{\cal S}}\right|^{2}.$

Due to other assumptions, (A3) trivially holds if $\left|{{\cal S}}\right|^{2}$ is replaced by $\left|{{\cal S}}\right|$ , though we need more regularity. Namely, our analysis shows that replacing $\left|{{\cal S}}\right|^{2}$ with $\left|{{\cal S}}\right|^{1+\alpha}$ , for some $\alpha\geq 0$ , approximations of the tangent field are valid only if $\kappa\left|{{\cal S}}\right|^{\alpha}$ falls below a certain threshold. Thus, for $\alpha=0$ this restricts the analysis to only SIMs and curves with small curvature. We select $\alpha=1$ for the sake of notational simplicity, though the results are valid for any $\alpha>0$ .

Our fourth assumption describes the behavior of $X$ orthogonal to the curve.

(A4)

For all $v\in\operatorname{Im}(\operatorname{Cov}\left({X|{\cal R}}\right))\cap\operatorname{Im}(Q_{{\cal R}})$ , with $\left\|{v}\right\|=1$ , we have

[TABLE]

In the nonlinear case an assumption of this form is necessary in order to ensure that the solution of local linear regression aligns with the local tangent vector instead of the local curvature vector. This is also observed numerically, where if the variance vanishes, as a function of ${\cal R}$ , a vector close to a local curvature vector can minimize (6). Such an assumption has also been used for multi index models, see [8, 27, 28], though assuming (A1) and (A2) would suffice in our case to ensure that the linear regression vector (for any conditioning on ${\cal R}\subset[0,1]$ ) is contained in the index space.

The last two assumptions deal with properties of the distribution along the curve $\gamma$ , denoted by $V$ , and with components orthogonal to it, denoted by $W$ .

(A5)

$W:=X-\pi_{\gamma}(X)$ , the component of $X$ orthogonal to $\operatorname{Im}(\gamma)$ , satisfies $\left\|{W}\right\|\leq B<\tau_{\gamma}$ , $W$ -a.s.

An assumption of this type is needed due to the fact that the projection $\pi_{\gamma}(X)$ , and consequently the function $f$ , is not always well defined for $\left\|{W}\right\|\geq\tau_{\gamma}$ . In case of a straight line we have $\tau_{\gamma}=\infty$ , and thus there are no restrictions on $W$ (which reflects standard SIM assumptions). On the other hand, (A5) is a relaxation of standard assumptions in manifold regression, which require samples $X$ to lie on, or very near the manifold, i.e., $\left\|{W}\right\|=0$ or $\left\|{W}\right\|\ll\tau_{\gamma}$ .

Lastly, we assume that the data distribution along the curve does not deviate too much from a uniform distribution. This is used in manifold regression approaches that approximate the manifold by localization and linearization, as it ensures that local pieces are sufficiently well covered, see e.g. [32].

(A6)

For random vectors $V:=\pi_{\gamma}(X)\in\operatorname{Im}(\gamma)$ there exists ${c_{V}}\!>\!0$ such that $c_{V}^{-1}\left|{{\cal S}}\right|\left|{{\cal I}}\right|^{-1}\!<\!\mathbb{P}(V\!\in\!{\cal S})\!<\!{c_{V}}\left|{{\cal S}}\right|\left|{{\cal I}}\right|^{-1}$ holds for any ${\cal S}\!\subset\!\operatorname{Im}(\gamma)$ .

A comparison of assumptions (A1)-(A6) with standard assumptions in the literature, and their implication in case of the SIM, is provided in Table 2.

3 Learning localized index vectors

We now begin with the analysis of our estimator by providing guarantees for the estimation of local index vectors in terms of $N$ , the number of samples, and $J$ , the number of level sets. The estimation of local index vectors follows three steps:

Step 1

Partition $X$ ’s according to a dyadic partitioning of the range222Technically, we ought to use ${\cal R}_{1}=[-\sigma_{\varepsilon},J^{-1}]$ , and ${\cal R}_{J}=[(J-1)/J,1+\sigma_{\varepsilon}]$ to account for noise at the boundaries, but for the sake of simplicity we assume $Y$ is thresholded to $[0,1]$ , such that $\left|{{\cal R}_{j}}\right|=J^{-1}$ for all $j$ . $\operatorname{Im}(f)=[0,1]$ ,

[TABLE] 2. Step 2

Estimate local index vectors with $\hat{a}_{j}:=\hat{b}_{j}/\|\hat{b}_{j}\|$ , where $\hat{b}_{j}$ is the solution of (local) linear regression for samples ${\cal X}_{j},\,{\cal Y}_{j}$ ,

[TABLE] 3. Step 3

Assign index vectors to samples $\{X_{i}:i\in[N]\}$ by setting $\hat{a}(X_{i}):=\hat{a}_{j(X_{i})}$ .

Denote now the tangent vectors by $a(X):=\gamma^{\prime}(t)$ and $a_{j}:=\gamma^{\prime}(\mathbb{E}[t|{\cal R}_{j}])$ , where $t=\gamma^{-1}\circ\pi_{\gamma}(X)$ . Because of the quantization in Step 3, index vector estimation error can be decomposed as

[TABLE]

where ${\cal S}_{j}$ is the infimum of all connected pieces of $\operatorname{Im}(\gamma)$ such that $\mathbb{P}(V\in{\cal S}_{j}|{\cal R}_{j})=1$ , and $\kappa_{j}$ its curvature bound. Since $\left|{{\cal S}_{j}}\right|\lesssim L_{f}\left|{{\cal R}_{j}}\right|$ as long as $J^{-1}=\left|{{\cal R}_{j}}\right|\gg\sigma_{\varepsilon}$ (by Lemma 11), the second term can be improved by increasing the number of level sets $J$ . On the other hand, for the first term we can prove the following concentration bound.

Theorem 2.

Let $J\in\mathbb{N}$ , $j\in[J]$ , and $u>1$ . Define ${\sigma_{{j},Y}}:=\operatorname{Var}\left({a_{j}^{\top}X,Y|{\cal R}_{j}}\right)(\left|{{\cal S}_{j}}\right|\left|{{\cal R}_{j}}\right|)^{-1}$ . Provided Assumptions (A1) - (A5) hold, there exist constants $C_{N},C_{A},C_{E}>0$ , depending polynomially on $L_{f}$ , ${\kappa}_{j}$ , $B$ , $\left|{{\cal I}}\right|$ , $C_{W}^{*}=(C_{W}\vee 3L_{f}{\sigma_{{j},Y}}\left|{{\cal R}_{j}}\right|)$ , $\sigma_{j,Y}^{-1}$ , $\sigma_{\perp}^{-1}$ , such that whenever

[TABLE]

we have

[TABLE]

The first condition in (15) deals with linearization, and effectively bounds the influence of the cross-covariance term $\left\|{\operatorname{Cov}\left({Q_{{\cal R}}X,P_{{\cal R}}X\lvert{\cal R}}\right)}\right\|$ from (A3). The condition gets easier to satisfy for shorter ${\cal S}_{j}$ , or shorter ${\cal R}_{j}$ . This goes in line with the discussion in Section 2, since by isolating shorter segments of $\operatorname{Im}(\gamma)$ , NSIM approaches the SIM, where the condition in (15), and assumption (A4), are trivally satisfied. For a weaker form of (A3), namely $\left\|{\operatorname{Cov}\left({Q_{{\cal R}}X,P_{{\cal R}}X\lvert{\cal R}}\right)}\right\|\leq{\kappa}_{j}C_{W}{\cal S}_{j}^{1+\alpha}$ , we obtain the same result with $J^{-\alpha}$ replacing $J^{-1}$ , and $J^{-(1+\alpha)}$ replacing $J^{-2}$ , see Theorem 16.

The second condition in (15) implies that, locally, there is a minimal number of samples needed to ensure that the norm of the linear regression solution $\|\hat{b}_{j}\|$ is bounded from below.

Lastly, we note that $C_{N},C_{A},C_{E}$ are proportional to powers of $\sigma_{j,Y}^{-1}$ , which implies that they are uniformly upper bounded (independent of $j$ ) if ${\sigma_{{j},Y}}$ is uniformly bounded from below. We show in Lemma 14 in the Appendix that this is indeed the case whenever $\left|{{\cal R}_{j}}\right|\gg\sigma_{\varepsilon}$ and $\textrm{Var}(a_{j}^{\top}X,f(X)|{\cal R}_{j})(\left|{{\cal S}_{j}}\right|\left|{{\cal R}_{j}}\right|)^{-1}$ is bounded from below. The latter is satisfied if for example $f\in{\cal C}^{2}$ , see Lemma 14. Due to the bi-Lipschitz property of $g$ , it seems reasonable however that $\textrm{Var}(a_{j}^{\top}X,f(X)|{\cal R}_{j})(\left|{{\cal S}_{j}}\right|\left|{{\cal R}_{j}}\right|)^{-1}$ is bounded from below in more general scenarios. The requirement $\left|{{\cal R}_{j}}\right|\gg\sigma_{\varepsilon}$ , on the other hand, is also observed numerically, precisely because $\textrm{Var}(a_{j}^{\top}X,Y|{\cal R}_{j})$ vanishes as soon as $\left|{{\cal R}_{j}}\right|-\sigma_{\varepsilon}$ is small. This suggests that our analysis correctly identifies the dependency on ${\sigma_{{j},Y}}$ .

Remark 3 (Special cases of Theorem 2).

$\sigma_{\varepsilon}=0$ :

In the noise-free case the lower bound for $J^{-1}$ is removed. Thus, provided $\left|{{\cal X}_{j}}\right|$ is kept constant and $J\asymp N$ , we achieve $\|\hat{a}_{j}-a_{j}\|\asymp N^{-1}$ . This proves a $N^{-1}$ rate for the estimation of the (local) index vector with the ordinary least squares estimator for strictly monotonic link functions. 2. ${\kappa}_{j}=0$ :

If ${\cal R}_{j}$ corresponds to a flat piece of the curve the first term in (16) vanishes. Thus, $\hat{a}_{j}$ is an unbiased estimator of $a_{j}$ , with convergence rate $\left|{N}\right|^{-1/2}$ , provided $J$ is kept constant and $N\asymp\left|{{\cal X}_{j}}\right|$ . This result covers the SIM, and our estimation rate matches other results [2, 4, 16].

Recalling decomposition (14), Theorem 2 can now be used to bound $\left\|{\hat{a}(X_{i})-a(X_{i})}\right\|$ for all $i\in[N]$ by invoking a union bound argument over all level sets ${\cal R}_{j}$ , $j\in[J]$ .

Corollary 4.

Let Assumptions (A1) - (A6) hold. Let $u>1$ and assume we have $N$ iid. copies of $(X,Y)$ . Assume we partition the data set into $J$ partitions according to (12), so that

[TABLE]

and $C_{W}^{*}:=(C_{W}\vee 3L_{f}{\sigma_{{J},Y}}J^{-1})$ , and compute local index vectors $\{\hat{a}_{j}:j\in[J]\}$ . There exist constants $C_{N},C_{A},C_{E}>0$ , depending polynomially on $L_{f}$ , ${\kappa}$ , $B$ , $\left|{{\cal I}}\right|$ , $C_{W}$ , $\sigma_{J,Y}^{-1}$ , $\sigma_{\perp}^{-1}$ , such that if

[TABLE]

we have

[TABLE]

Let us make two remarks. First, terms in the bound on the right hand side of (19) can also be written in a local form, i.e., a global curvature bound can be replaced with a curvature bound for a segment around the sample $\pi_{\gamma}(X_{i})$ . Thus, the learning of local index index vectors is consistent on locally linear pieces.

Second, (17) and (18) suggest that to optimally balance bias and variance we ought to use $J=C\min\{N/\log^{2}(N),\sigma_{\varepsilon}^{-1}\}$ level sets, where $C>0$ is small enough so that (18) is satisfied. Looking at (19), this implies that there are two regimes.

In the first regime, in order to decrease the error we ought to increase $J$ as long as $J\gg\sigma_{\varepsilon}$ , i.e., subdivide the data set into an increasing number of subsets, while keeping the number of samples within each subset roughly constant. The rationale behind this is that further subdividing the data set not only reduces the approximation error (which is caused by the curvature), but it also reduces the variance in the linear regression part of the problem, i.e., when estimating $a_{j(X_{i})}$ by $\hat{a}_{j(X_{i})}$ . In the second regime function noise precludes further decreasing $\left|{{\cal R}_{j}}\right|$ , since we cannot further decrease $\left|{{\cal S}_{j}}\right|$ . In other words, the noise level $\sigma_{\varepsilon}$ imposes a lower bound on $\left|{{\cal R}_{j}}\right|$ , and the bias does not completely vanish.

Note also that in (19), compared to (16), we lose an order in $J^{-1}$ , i.e., in the interval length. This is due to the use of quantization to approximate the entire tangent field over the respective level set. This could be improved by learning a separate tangent for each sample $X_{i}$ from a level set centred around $X_{i}$ , but the second term in (19) prohibits achieving $J^{-2}$ overall.

4 Function estimation

In this section we use the guarantees on local index vector estimation to establish function estimation guarantees. We recall that the estimator (9) predicts an output by averaging the responses of $\{(X_{i(x)},Y_{i(x)}):i\in[k])\}$ , the $k$ closest samples with respect to the metric $\Delta_{\eta}({x},{\cdot})$ . This makes the analysis challenging because the same data is used twice: first for estimating the geometry and then for predicting the function. As a result, random variables $\left\{\varepsilon_{i(x)}:i\in[k]\right\}$ become statistically dependent and their finite sample average could be biased.

To avoid this technical issue split the given data set (consisting of $2N$ samples) in two halves (reducing the effective sample size only by a factor of $1/2$ ) and use the first half, $\{(X_{i},Y_{i}):i\in[N]\}$ , for approximating the geometry, and the second half, $\{(X^{\prime}_{\ell},Y_{\ell}^{\prime}):\ell\in[N]\}$ , for function prediction. We then extend the tangent field approximation through nearest neighbors, defining $\hat{a}(X_{\ell}^{\prime}):=\hat{a}(X_{i^{*}})$ , where $i^{*}:=\operatorname*{argmin}_{i\in[N]}\Delta_{\eta}({X_{\ell}^{\prime}},{X_{i}})$ . The prediction of $f(x)$ is then given by averaging the responses, $Y^{\prime}_{\ell(x)},\,\ell\in[k]$ , of $k$ closest samples with respect to $\Delta_{\eta}({x},{\cdot})$ from $\{X_{\ell}^{\prime}:\ell\in[N]\}$ , see Algorithm 2. Thus, random variables $\varepsilon_{\ell}^{\prime}$ are not used in the selection of $k$ closest neighbors of $x$ and we preserve unbiased finite sample averages, i.e., $\mathbb{E}\varepsilon_{\ell(x)}^{\prime}=\mathbb{E}\varepsilon=0$ .

We split our analysis in two parts. The first concerns the case when $\gamma$ is close to an affine space (see Definition 5), and we call it a perturbed single index model. The second part extends the analysis to general curves $\gamma$ . The reason for treating the first case separately is that we can achieve theoretical guarantees even without restricting the search space of nearest neighbors, i.e. setting $\eta=\infty$ . Furthermore, numerical experiments in Section 5 suggest that perturbed SIMs fit well to several data sets that were previously used as benchmarks for the SIM.

4.1 Function estimation for perturbed single index models

We begin by defining the notion of almost linearity that is used to quantify the deviation of the true model to an ordinary SIM, respectively, of the curve $\gamma$ to a straight line.

Definition 5.

Let $\mathfrak{I}$ be an interval and $\gamma:\mathfrak{I}\rightarrow\mathbb{R}^{D}$ an arc-length parametrized ${\cal C}^{1}(\mathfrak{I})$ curve. Let $0<\theta\leq 1$ . We say $\gamma$ is $\theta$ -almost linear if $\left<{\gamma^{\prime}(t)},{\gamma^{\prime}(s)}\right>>\theta$ for all $t,s\in\mathfrak{I}$ .

Definition 5 implies that if $\theta$ is close to $1$ then $\gamma$ is close to a straight line. Furthermore, the Euclidean distance approximates the geodesic distance well, i.e. $\left\|{v-v^{\prime}}\right\|\asymp d_{\gamma}(v,v^{\prime})$ for any $v,v^{\prime}\in\operatorname{Im}(\gamma)$ , which allows to prove an equivalence between the (unrestricted) proxy metric $\Delta_{\infty}({x},{\cdot})$ and $d_{\gamma}(x,\cdot)$ .

Proposition 6.

Assume $\gamma$ is $\theta$ -almost linear for some $\theta>\kappa B$ . Let $\{\bar{x}_{i}:i\in[N]\}\subset\textrm{supp}(\rho_{X})$ , and $\{\hat{a}(\bar{x}_{i}):i\in[N]\}\subset\mathbb{S}^{D-1}$ be arbitrary sets. Let $x\in\textrm{supp}(\rho_{X})$ . If $\bar{x}_{k(x)}$ is $k$ -th closest sample, based on $\Delta_{\infty}({x},{\cdot})$ , and $\bar{x}_{k^{*}(x)}$ the $k$ -closest sample, based on $d_{\gamma}(x,\cdot)$ , we have

[TABLE]

Note that curvature and reach of a curve $\gamma$ always satisfy $\kappa\tau_{\gamma}\leq 1$ . This means that $\kappa B<1$ is trivially satisfied, since $B<\tau_{\gamma}$ by (A5). Thus, the requirement $\theta>\kappa B$ is driven by linearization, namely, by the fact that we are approximating the geodesic geometry of samples projected onto a curved space, with a linear geometry of samples projected onto its linerization.

To show guarantees for function estimation we first need to derive bounds on the tangent field $\max_{\ell\in[N]}\left\|{\hat{a}(X_{\ell}^{\prime})-a(X_{\ell}^{\prime})}\right\|$ from bounds on $\max_{i\in[N]}\left\|{\hat{a}(X_{i})-a(X_{i})}\right\|$ , given by Corollary 4. Using Proposition 6 with sets $\{X_{i}:i\in[N]\}$ and $\{\hat{a}(X_{i}):i\in[N]\}$ , for all $\ell\in[N]$ we have

[TABLE]

where $X_{1^{*}(X_{\ell}^{\prime})}$ is the sample closest to $X_{\ell}^{\prime}$ with respect to the geodesic distance. We can now state the main result for function estimation.

Theorem 7.

Assume (A1) - (A6). Let $\eta=\infty$ and assume that $\gamma$ is $\theta$ -almost linear for some $\theta>\kappa B$ . Whenever $N,J$ satisfy the conditions of Corollary 4, we have for arbitrary $x\in\textrm{supp}(\rho_{X})$ and $1<u<N$

[TABLE]

with probability at least $1-\exp(-u)$ , where $C_{A},C_{E}>0$ are constants from Corollary 4, $C>0$ is an absolute constant and $C_{B}=2L_{f}(2\vee(\left|{{\cal I}}\right|+2B))\left(1+{\kappa}(2\vee(\left|{{\cal I}}\right|+2B)\right)$ .

Proof.

We first decompose the left-hand side of (22) as

[TABLE]

The first term is a sum independent copies of $\varepsilon$ . Since $\left|{\varepsilon}\right|\leq\sigma_{\varepsilon}$ almost surely, and $\mathbb{E}\varepsilon=0$ , Höffding’s inequality for bounded random variables gives, for an absolute constant $C>0$

[TABLE]

Assume now $\{\pi_{\gamma}(X_{\ell}^{\prime}):\ell\in[N]\}$ and $\{\pi_{\gamma}(X_{i}):i\in[N]\}$ are $\delta$ -nets for $\operatorname{Im}(\gamma)$ with respect to $d_{\gamma}$ . We can use the Lipschitz property of $g$ and apply Proposition 6 to bound the second term as

[TABLE]

Using (21) and $d_{\gamma}(X_{\ell}^{\prime},X_{1^{*}(X_{\ell}^{\prime})})\leq\delta\leq\delta k$ we get

[TABLE]

Lemma 20 gives that $\{\pi_{\gamma}(X_{\ell}^{\prime}):\ell\in[N]\}$ and $\{\pi_{\gamma}(X_{i}):i\in[N]\}$ are $\delta$ -nets for $\delta=\left|{{\cal I}}\right|u\left({c_{V}}N\right)^{-1}$ with probability $1-2\exp(-u)$ . The claim then follows by Corollary 4. ∎

Theorem 7 reveals that the error in function estimation originates from three sources. The first term accounts for the averaging of the noise, which is incurred by responses $Y_{\ell}^{\prime}$ . Using $k={\cal O}(N^{2/3})$ , as is standard for Lipschitz-smooth functions, it decays at a rate $N^{-1/3}$ . The second term bounds the geodesic distance to the nearest neighbor, and comes from the covering of the curve by the projected samples. The last two terms are from the approximation of the geodesic metric with the proxy metric $\Delta_{\infty}({x},{\cdot})$ through tangent approximations $\{\hat{a}(X_{i}):i\in[N]\}$ , and behave according to Corollary 4. Setting $k={\cal O}(N^{2/3})$ and $J=C\min\{N/\log^{2}(N),\sigma_{\varepsilon}^{-1}\}$ , as in Section 3, yields

[TABLE]

We see that the estimator is generally biased since the error tends to ${\kappa}/(\theta-{\kappa}B)^{2}\sigma_{\varepsilon}$ for $N\rightarrow\infty$ .

Remark 8 (Special cases of Theorem 7).

$\sigma_{\varepsilon}=0$ :

In the noise-free case the first term in (22) vanishes, and thus choosing $k=1$ , and $J=CN/\log(N)^{2}$ with $C$ small enough so that (18) holds, we have

[TABLE]

Up to logarithmic factors, this matches the optimal rate for noise-free estimation of Lipschitz functions, see [21]. 2. ${\kappa}=0$ :

If the model follows the ordinary SIM the second term in (23) vanishes. Thus we achieve a $N^{-1/3}$ rate, which is optimal for Lipschitz smooth functions [13].

Before moving to general curves, let us remark why achieving consistent estimation is a challenging task in the noisy, nonlinear case. The presented estimator is based on localization and linearization, where localization hinges on the fact that conditional distributions $(X,Y)|{\cal R}_{j}$ are increasingly SIM-like when reducing the level set width $\left|{{\cal R}_{j}}\right|=J^{-1}$ . This reduces the effects of curvature and linearization becomes increasingly accurate. On the other hand, relating the width of ${\cal R}_{j}$ with the length of corresponding segment ${\cal S}_{j}$ of the curve, as $\left|{{\cal R}_{j}}\right|\asymp\left|{{\cal S}_{j}}\right|$ , is by Lemma 11 valid only if $\left|{{\cal R}_{j}}\right|>2\sigma_{\varepsilon}$ . Namely, reducing ${\cal R}_{j}$ beyond that threshold does not reduce $\left|{{\cal S}_{j}}\right|$ , i.e., $X\lvert{\cal R}_{j}$ does not become more SIM-like. This predicament can not be further improved under our noise model.

Results in this section imply that having a consistent estimator of the tangent field of $\operatorname{Im}(\gamma)$ , whose sample complexity does not depend exponentially on $D$ , is sufficient to construct a consistent estimator for $f$ , with a similar sample complexity. At the same time, a consistent, low-complexity estimator of $f$ can be used to estimate the tangent field, by approximating $\nabla f$ through finite sample differences. This suggests a certain equivalence between estimating $f$ and estimating the tangent field of $\operatorname{Im}(\gamma)$ , and to some extent the manifold $\operatorname{Im}(\gamma)$ itself.

Minimax rates for estimating a manifold from $N$ samples $\{X_{i}:i\in[N]\}$ that are spread around it have been extensively studied in [11, 12]. Moreover, in [12] the authors provide a theoretical estimator that converges at a $(\log(N)/N)^{2/(2+d)}$ rate (measured in the Hausdorff distance), where $d$ is the dimensionality of the manifold. However, they emphasize that the estimator is not practical and pose the development of a practical alternative as an important open problem. To the best of our knowledge, this problem still has not been solved.

4.2 Extension to general curves

In the general case the unrestricted proxy metric $\Delta_{\infty}({x},{\cdot})$ is not equivalent to the geodesic metric $d_{\gamma}(x,\cdot)$ , and thus cannot be used to reliably select nearest neighbors. To better illustrate this point, let $\gamma$ be a segment of the unit circle that contains two antipodal points $\pi_{\gamma}(x)$ and $\pi_{\gamma}(x^{\prime})$ , and assume we have access to the true tangents $a(x)$ , $a(x^{\prime})$ , so that $a(x)=-a(x^{\prime})$ . Thus, on one hand we have $d_{\gamma}(x,x^{\prime})=\pi$ , and on the other $\Delta_{\infty}({x},{x^{\prime}})=\Delta_{\infty}({x^{\prime}},{x})=0$ since $a(x)\perp x^{\prime}-x$ .

To avoid this and establish an equivalence between $d_{\gamma}(x,\cdot)$ and $\Delta_{\eta}({x},{\cdot})$ , similar to Proposition 6, we thus have to restrict the search space. Considering the unit circle example, we ought to choose $\eta>0$ that ensures there are no two points $x,x^{\prime}$ , such that $d_{\gamma}(\pi_{\gamma}(x),\pi_{\gamma}(x^{\prime}))\gg 0$ , but $\left\|{x-x^{\prime}}\right\|\leq\eta$ and $\left|{a(x)^{\top}(x-x^{\prime})}\right|=0=\left|{a(x^{\prime})^{\top}(x^{\prime}-x)}\right|$ . It can be shown that this is satisfied for $\eta<2(\tau_{\gamma}-B)$ , provided assumption (A5) holds, see Figure 3 and Lemma 22. On the other hand, $\eta$ needs to be large enough to ensure there are enough samples within ${\cal B}_{\left\|{\cdot}\right\|}(x,\eta)$ , with respect to $N$ , to achieve optimal function prediction rates. Under the uniformity assumption (A6), this is ensured whenever $\eta>2B$ .

Balancing these two demands we get $2B<\eta<2(\tau_{\gamma}-B)$ and thus $B<1/2\tau_{\gamma}$ . Therefore, we require a more restrictive version of (A5). To compensate for errors in tangent approximations we further impose $\eta<\tau_{\gamma}$ . This allows to prove a guarantee for metric equivalence.

Proposition 9.

Assume (A5) for $B=(1/2-q)\tau_{\gamma}$ for some $q>0$ , and choose any $\eta\in(2B,\tau_{\gamma})$ . Let $\{\bar{x}_{i}:i\in[N]\}\subset\textrm{supp}(\rho_{X})$ , and $\{\hat{a}(\bar{x}_{i}):i\in[N]\}\subset\mathbb{S}^{D-1}$ be arbitrary sets. For an arbitrary $x\in\textrm{supp}(\rho_{X})$ let $\bar{x}_{k(x)}$ be its $k$ -th closest sample based on $\Delta_{\eta}({x},{\cdot})$ , and $\bar{x}_{k^{*}(x)}$ be its $k$ -closest sample based on $d_{\gamma}(x,\cdot)$ . Whenever $\{\pi_{\gamma}(X_{i}):i\in[N]\}$ forms a $\delta$ -net on $\operatorname{Im}(\gamma)$ , and

[TABLE]

we have

[TABLE]

Covering the manifold $\operatorname{Im}(\gamma)$ with a sufficiently fine $\delta$ -net $\{\pi_{\gamma}(X_{i}):i\in[N]\}$ , and condition (25), are satisfied with high probability as soon as $N$ is sufficiently large, due to Corollary 4 and Lemma 20, respectively. In that case, Theorem 7 holds also for general curves, by simply replacing Proposition 6 with Proposition 9 in the proof. Since for $N\rightarrow\infty$ the term $\max_{i\in[N]}\left\|{\hat{a}(\bar{X}_{i})-a(\bar{X}_{i})}\right\|$ converges to ${\cal O}({\kappa}\sigma_{\varepsilon})$ by Corollary 4, we are ensured to enter the valid regime whenever the noise $\sigma_{\varepsilon}$ is small enough compared to $q$ (in particular in the noise-free case).

Proposition 10.

Assume (A1) - (A6), and the conditions of Proposition 9 hold. Let $\eta\in(2B,\tau_{\gamma})$ . Whenever $N$ , $J$ satisfy the conditions of Corollary 4, we have for arbitrary $x\in\textrm{supp}(\rho_{X})$ and $1<u<N$

[TABLE]

with probability at least $1-\exp(-u)$ , where $C_{A},C_{E}>0$ are constants from Corollary 4, $C>0$ is an absolute constant and $C_{B}=32L_{f}(2\vee(\left|{{\cal I}}\right|+2B)\vee\tau_{\gamma})\left(1+{\kappa}(2\vee(\left|{{\cal I}}\right|+2B)\vee\tau_{\gamma})\right)$ .

5 Numerical Experiments

In this section, we present experimental results of the proposed estimator in two settings. First, we conduct synthetic experiments to validate theoretical results of Sections 3 and 4. Second, we benchmark the estimator against commonly used methods on a selection of real-world data sets. The source code for Algorithm 1 and synthetic experiments is available at https://github.com/soply/nsim_algorithm. Moreover, real-world data sets, code for their preprocessing, and implementations of competing estimators (or references, if publicly available source code is used) are readily available at https://github.com/soply/local_sim_experiments.

5.1 Experiments with synthetic data

General setup.

We consider the following three curves

[TABLE]

and embed them into $\mathbb{R}^{D}$ for $D\in\{4,8,12\}$ . We set $X=V+F(V)U$ , where $V$ is sampled uniformly on $\operatorname{Im}(\gamma)$ , $U$ is sampled uniformly on ${\cal B}_{\left\|{\cdot}\right\|}(0,0.25)$ , and the rows of $F(V)\in\mathbb{R}^{D\times(D-1)}$ form an orthonormal basis for the normal space of $\operatorname{Im}(\gamma)$ at $V$ . Examples of such marginal distributions are illustrated in the top row of Figure 4. The target function $g\circ\gamma^{-1}$ is a strictly monotonic, piecewise quadratic polynomial. We set $Y=g\circ\gamma^{-1}(\pi_{\gamma}(X))+\varepsilon$ with $\varepsilon\sim\operatorname{Uni}({[-\sigma_{\varepsilon},\sigma_{\varepsilon}]})$ . Different noise levels are used: $\sigma_{\varepsilon}=c\Delta f$ with $c\in\{0\}\cup\{10^{-\ell}:\ell=1,\ldots,4\}$ and $\Delta f:=(\max_{i}f(X_{i})-\min_{i}f(X_{i}))\left|{{\cal I}}\right|^{-1}$ .

Parameter selection for the NSIM estimator is guided by Section 4. Namely, we use $k=1$ and $J=(15D)^{-1}N$ if $\sigma_{\varepsilon}=0$ , and $k=1/2N^{2/3}$ with cross-validation over $J\in\{2^{\ell}:\ell\in[13]\}$ in the noisy case. Furthermore, the restricting radius for the nearest neighbor search is $\eta=0.5$ . We also train an ordinary kNN-regressor with $k=1$ in the noise-free case, and $k=1/2N^{2/3}$ in noisy case, to demonstrate that in these problems ordinary kNN-regression indeed suffers from the curse of dimensionality.

For evaluating the NSIM estimator, we report the root mean squared errors (RMSE)

[TABLE]

where $\{Z_{m}:m\in[1000]\}$ are test samples iid. from $\rho_{X}$ , and $J=J(N)$ is chosen as described above. The results are averaged over 20 repetitions of the same experiment. The standard deviation is indicated by vertical bars.

Discussion

The results of our studies are presented in Figure 5. In red plots, which correspond to cases with $\sigma_{\varepsilon}=0$ , we observe a $N^{-1}$ decay of the function error (Figures 5a - 5c), and similarly a $N^{-1}$ decay of the tangent field error (Figures 5d - 5f). In particular, the ambient dimension $D$ affects the error only in terms of a multiplicative constant but not in the rate of decay. Therefore, the NSIM estimator does not suffer from the curse of dimensionality, which is not the case for ordinary kNN-regression as shown in Figures 5g - 5i.

The remaining plots in Figure 5 represent noisy cases, where the highest noise level corresponds to blue lines. Considering the first column, where $\operatorname{Im}(\gamma)$ is a straight line, and therefore the data follows an ordinary SIM, we see that the error for function and index vector estimation steadily decreases at a $N^{-1/3}$ rate. This confirms our theoretical result, i.e., the NSIM estimator is consistent, and achives the optimal rate, in case of an ordinary SIM. If we have a curved geometry and function noise on the other hand, errors for function prediction and tangent field estimation stall after reaching a certain quality. This can be seen e.g. in the blue plots in Figures 5b and 5c.

We remark here that estimators, that are used for comparison on real data sets in the next section, have been tested on these synthethical problems as well. We omit corresponding results because none show any improvement as the sample size $N$ increases (apart from SIM estimators and the Line problem). This is expected for SIM estimators because they can not resolve the underlying nonlinear geometry during training.

5.2 Real data

We will now test the NSIM algorithm and compare it to other commonly used algorithms on a variety of real-worlds data sets. We report the mean RMSE and its standard deviation over 30 repetitions of each experiment. In each run, we use $15\%$ of the data as the test set, and we tune hyper-parameters for each estimator using 5-fold cross-validation on exhaustive parameter grids.

Data sets.

We use 6 UCI data sets (Air Quality, Boston Housing, Concrete, Istanbul Stock Exchange, Skillcraft1, Yacht) and the Ames Housing data set in our study. For each data set, the components of $X$ are standardized and we exclude clearly irrelevant features. Moreover, if the marginal of $\tilde{Y}=\log(Y)$ resembles the uniform distribution better (compared to $Y$ ), we use $\tilde{Y}$ instead of $Y$ . The preprocessed the data sets are readily available at https://github.com/soply/db_hand.

Estimators.

•

NSIM-dyad, respectively, NSIM-stat refer to Algorithm 1 using a dyadic partition, respectively statistically equivalent blocks. $k$ and $J$ are chosen via cross-validation. The radius is the intersecting Euclidean ball is determined by $\eta=\infty$ .

•

Lin-Reg and kNN are standard linear regression and kNN-regression.

•

SIR-kNN uses sliced inverse regression [29] to find an index vector $a$ , and then kNN on projected samples $(a^{\top}X,Y)$ . Replacing SIR by SAVE [8] uniformly worsens the results.

•

Isotron, [20], iteratively fits the link function $g$ using isotonic regression [39] on projected samples $(a^{\top}X,Y)$ , and then updates the index vector $a$ . The iteration is initialized with $a=0$ and stopped when the validation error stalls on a hold-out set.

•

ELM-Sig, [17], is a shallow neural network with sigmoid activation where inner biases and weights are randomly sampled, and only the outer layer is trained on data. This can be done by solving a simple linear system, which makes the algorithm very efficient. We also tested the hyperbolic tangent activation function, but the results were uniformly worse.

•

SNN-Tan and SNN-Sig are standard shallow neural networks with hyperbolic tangent and sigmoid activation functions, respectively. We train them using stochastic gradient descent (learning rate $0.01$ ), and stop the iteration when the validation error stalls on an inner validation set. As for ELM, we use 5-fold cross-validation for the number of hidden nodes.

Discussion.

The results are presented in Table 3. It is helpful to divide these estimators into two groups. The first group consists of simple estimators (kNN and linear regression) and of estimators that use a reduced ( $1D$ ) representation of the data (NSIM, SIR and Isotron). The second group are shallow neural networks which search for an estimator in a considerably richer class of functions. Among the first group, NSIM variants achieve very convincing results as they always belong to the best performing group of estimators. Moreover, experiments suggest that our approach adapts well to the complexity of a given data set. For example, on a data set where linear regression performs best (Istanbul), NSIM achieves roughly the same performance, and automatically chooses (most of the time) $J=1$ . On the other hand, for the Concrete data set, where all models that use a single index vector perform rather poorly, the added model flexibility of the NSIM approach proves beneficial, and we achieve the same performance as kNN, despite reducing the dimensionality. This is not the case for SIR-kNN and Isotron, both of which use a linear $1D$ projection. Finally, on Air Quality and Yacht, NSIM-stat achieves superior performance while leveraging the enhanced model flexibility with $J\approx 5$ level sets.

Estimators in the second group enjoy a greater model flexibility, but are at the same time more prone to overfitting. For data sets with a lot of samples (Air Quality, Concrete, and Skillcraft), these methods are better than the estimators in the first group. On the other hand for data sets with smaller sample sizes (Istanbul and Yacht), the model can not be fitted easily, and we observe exactly the opposite effect. Considering the results for the Ames data set, all estimators perform roughly the same.

Interpretability.

An important feature of the SIM is its interpretability, because the recovered index vector describes the relationship between each feature and the response $Y$ . Namely, the $i$ -th entry of the index vector $\hat{a}$ should have a large magnitude if the corresponding feature has a strong influence on $Y$ (relative to other features), and its sign indicates if the feature increases or decreases Y (when keeping other entries fixed). NSIM retains these properties and allows for a more refined analysis, since it considers conditional distributions, $X|{\cal R}_{j}$ , for different ranges of the response. By inspecting and comparing local index vectors we can thus analyze whether the influence of features changes across different regimes.

To that end, we propose to study off-diagonal entries of Grammian matrices, $G\in\mathbb{R}^{J\times J}$ , where $G_{ij}=\hat{a}_{i}^{\top}\hat{a}_{j}$ , after fitting the model for a range of $J$ ’s. If $G_{ij}\approx 1$ everywhere, and for all $J$ , then the model most likely follows the traditional, monotone SIM. On the other hand, if roughly $G_{ij}\asymp\left|{i-j}\right|^{-1}$ , then local index vectors indeed vary, with certain regularity, as a function of $Y$ .

In Figure 6 we plot the results for this method on Air Quality, Concrete, and Skillcraft data sets. We see that the the pair-wise similarity $G_{ij}$ is indeed inverse proportional to $\left|{i-j}\right|^{-1}$ , suggesting that NSIM fits the data better than SIM. Results in Table 3 confirm this, by showing that NSIM outperforms SIM-based estimators (Lin-Reg, SIR-kNN, and Isotron).

6 Conclusions

In this paper we propose a nonlinear relaxation of the single index model for data sets with inherent monotonicity between features and outputs. We propose to estimate the model by combining localization through level set partitioning, local linear regression and a kNN-regressor for out-of-sample prediction. Our theoretical results provide guarantees on the error of the quantization of the tangent field of $\operatorname{Im}(\gamma)$ , and yield guarantees for out-of-sample prediction. In the noise free case we provide optimal learning rates, while in the noisy case we generally have a biased estimator. If the NSIM reduces to a SIM, i.e. if $\operatorname{Im}(\gamma)$ is a straight line, we recover the optimal learning rates for estimating the SIM also in the noisy case.

Our numerical experiments show that the NSIM estimator yields superior results when compared to estimators of similar model complexity. Moreover, the estimator outperforms shallow neural network models on data sets with rather few samples. On the other hand, if the data sets are sufficiently rich to properly fit shallow networks models, their additional flexibility pays off and NSIM does not achieve similar predictive accuracy. Consequently, our future research direction aims at further enhancing the model space of our estimator, by replacing kNN with more sophisticated regressors and learning multiple index vectors, i.e. multi index models, in each level set.

Supplementary Materials

Code to replicate the experiments in the article is available at IMAIAI online.

Funding

This work was supported by the Research Council of Norway [251149/O70 to V.N.].

Acknowledgements

T.K. thanks Prof. Mauro Maggioni, Stefano Vigogna and Alessandro Lanteri for helpful discussions about the project.

Appendix A Appendix

A.1 Proofs for Section 3

This section is split into two parts. The first concerns a local analysis and establishes Theorem 2. The second part deals with the global analysis and proves Corollary 4.

A.1.1 Local analysis

Before we begin with the proof of Theorem 2 we collect some required auxiliary results. All these results describe local phenomena, which means we can consider consider a fixed, arbitrary closed interval ${\cal R}\subset[0,1]$ with corresponding minimal ${\cal S}:={\cal S}({\cal R})\subset\operatorname{Im}(\gamma)$ such that $\mathbb{P}(V\in{\cal S}({\cal R})|{\cal R})=1$ . We denote $\bar{t}:=\mathbb{E}[t|{\cal R}]$ , $a:=\gamma^{\prime}(\bar{t})$ , $P:=aa^{\top}$ , $Q:=\mathsf{Id}-P$ . For notational simplicity, we do not use a subscript ${\cal R}$ for e.g. $\Sigma,\kappa,N$ and so on, but keep in mind that all quantities are understood locally. We use $\lesssim$ to absorb universal numeric constants.

Auxiliary results

The following result shows that the length of ${\cal R}$ and ${\cal S}$ are equivalent up to the Lipschitz constant $L_{f}$ , and provided ${\cal R}\gg\sigma_{\varepsilon}$ .

Lemma 11.

Take an interval ${\cal R}\subset\operatorname{Im}(f)$ and let ${\cal S}\subset\operatorname{Im}(\gamma)$ be the shortest segment such that $\mathbb{P}(V\in{\cal S}|Y\in{\cal R})=1$ . Then $L_{f}^{-1}(\left|{{\cal R}}\right|-2\sigma_{\varepsilon})\leq\left|{{\cal S}}\right|\leq L_{f}(\left|{{\cal R}}\right|+2\sigma_{\varepsilon})$ .

Proof.

For any $V,V^{\prime}\in{\cal S}$ we have $\left|{Y-Y^{\prime}}\right|-2\sigma_{\epsilon}\leq\left|{f(V)-f(V^{\prime})}\right|\leq\left|{Y-Y^{\prime}}\right|+2\sigma_{\epsilon}$ , almost surely, where $Y,Y^{\prime}$ are such that $Y=f(V)+\epsilon$ , $Y^{\prime}=f(V^{\prime})+\epsilon^{\prime}$ . Using (10) we have

[TABLE]

and the upper bound follows after taking the supremum over $V,V^{\prime}$ . For the converse, taking $(X,Y),(X^{\prime},Y^{\prime})$ be such that $\left|{Y-Y^{\prime}}\right|=\left|{{\cal R}}\right|$ , we have

[TABLE]

∎

Next we provide some basic bounds on spectral properties of the conditional covariance matrix. We use in the proof that a random vector $Z$ satisfying $\left\|{Z-Z^{\prime}}\right\|\leq M$ almost surely, where $Z^{\prime}$ is an independent copy of $Z$ , satisfies $\left\|{\operatorname{Cov}\left({Z}\right)}\right\|\leq\mathbb{E}\left\|{Z-\mathbb{E}Z}\right\|^{2}=1/2\mathbb{E}\left\|{Z-Z^{\prime}}\right\|^{2}\leq 1/2M^{2}$ .

Lemma 12.

Let (A1), (A2) and (A5) hold. Take an interval ${\cal R}\subset\operatorname{Im}(f)$ and let ${\cal S}\subset\operatorname{Im}(\gamma)$ be the shortest segment such that $\mathbb{P}(V\in{\cal S}|{\cal R})=1$ . Then the following holds:

[TABLE]

Proof.

For (27) we use $a(V)\perp W$ and (A5) to get $\left|{a^{\top}W}\right|=\left|{(a-a(V))^{\top}W}\right|\leq\kappa\left|{{\cal S}}\right|B$ . Since $\mathbb{E}[W|Y]=0$ by (A1) and (A2) it follows that $\operatorname{Var}\left({a^{\top}W|{\cal R}}\right)=\mathbb{E}[(a^{\top}W)^{2}|{\cal R}]\leq(B\kappa\left|{{\cal S}}\right|)^{2}$ . For (28), we have by the fundamental theorem of calculus and $Q\perp\gamma^{\prime}(\bar{t})$

[TABLE]

(29) follows by the Cauchy-Schwarz inequality $\left\|{\operatorname{Cov}\left({V,QV|{\cal R}}\right)}\right\|\leq\sqrt{\left\|{\operatorname{Cov}\left({V|{\cal R}}\right)}\right\|\left\|{\operatorname{Cov}\left({QV|{\cal R}}\right)}\right\|}$ , $\left\|{\operatorname{Cov}\left({V|{\cal R}}\right)}\right\|\leq 1/2\left|{{\cal S}}\right|^{2}$ since $\left\|{V-V^{\prime}}\right\|\leq\left|{{\cal S}}\right|$ , and using (28). ∎

While upper bounds for spectral norms of covariance matrices are easily obtained in the previous Lemma, lower bounds for variances are generally more challenging to establish. In particular they have to rely on an assumption such as (A6), which asserts that the marginal distribution of $V$ is (measure-theoretically) equivalent to the uniform distribution. Our analysis in Section 3 hinges on the relation $\operatorname{Var}\left({a^{\top}X,Y|{\cal R}}\right)\asymp\left|{{\cal S}}\right|\left|{{\cal R}}\right|$ . The following two results show that this is true for example if $f\in{\cal C}^{2}$ and $\left|{{\cal R}}\right|\gg\sigma_{\varepsilon}$ . However we believe that more general conditions just relying on the monotonicity/bi-Lipschitz properties of $f$ could be established.

Lemma 13.

Let Assumptions (A1), (A2) and (A6) hold. For any interval ${\cal R}\subset\operatorname{Im}(f)$ with $\left|{{\cal R}}\right|>2\sigma_{\varepsilon}$ and ${\cal S}\subset\operatorname{Im}(\gamma)$ as the shortest segment such that $P(V\in{\cal S}|{\cal R})=1$ we have

[TABLE]

Proof.

Note that (A1) and (A2) imply $\operatorname{Cov}\left({V,W|{\cal R}}\right)=0$ and therefore $\operatorname{Var}\left({\left<{a},{X}\right>|{\cal R}}\right)=\operatorname{Var}\left({\left<{a},{V}\right>|{\cal R}}\right)+\operatorname{Var}\left({\left<{a},{W}\right>|{\cal R}}\right)$ . The upper bound follows from (27) and the fact that $\left\|{V-V^{\prime}}\right\|\leq\left|{{\cal S}}\right|$ almost surely, for an independent copy $V^{\prime}$ of $V$ , implies $\operatorname{Var}\left({a^{\top}V|{\cal R}}\right)\leq 1/2\left|{{\cal S}}\right|^{2}$ .

For the lower bound it suffices to concentrate on $\operatorname{Var}\left({\left<{a},{V}\right>|{\cal R}}\right)$ . We first use the identity $\mathbb{E}\left|{Z-\mathbb{E}[Z]}\right|^{2}=1/2\mathbb{E}\left|{Z-Z^{\prime}}\right|^{2}$ ( $Z^{\prime}$ is an independent copy of $Z$ ) to get

[TABLE]

The first term is bounded from below by $\left\langle a,\gamma^{\prime}(s)\right\rangle\geq 1-{\kappa}\left|{{\cal S}}\right|$ . For the second term, we fix $c>0$ (is optimized later) and use Chebyshev’s inequality to get $\mathbb{E}[(t-\bar{t})^{2}|{\cal R}]\geq c^{2}\mathbb{P}(\left|{t-\bar{t}}\right|>c|{\cal R})$ . Let now ${\cal I}^{-}$ be any interval satisfying $\mathbb{P}(Y\in{\cal R}|V\in\gamma({\cal I}^{-}))=1$ . Then by using (A6) it follows that

[TABLE]

Optimizing now over $c$ we find $c=1/3{c_{V}}^{-2}\left|{{\cal I}^{-}}\right|$ gives the bound $\mathbb{E}[(t-\bar{t})^{2}|{\cal R}]\geq 1/27{c_{V}}^{4}\left|{{\cal I}^{-}}\right|^{2}$ which implies that we ought to make ${\cal I}^{-}$ as large as possible. Clearly, this is the case when setting ${\cal I}^{-}:=\gamma^{-1}\circ f^{-1}([\inf{\cal R}+\sigma_{\epsilon},\sup{\cal R}-\sigma_{\epsilon}])$ with $\left|{{\cal I}^{-}}\right|>L_{f}^{-1}(\left|{{\cal R}}\right|-2\sigma_{\epsilon})$ . ∎

Lemma 14.

Let (A1), (A2), and (A6) hold. If $f\in{\cal C}^{2}(\Omega)$ for $\Omega:=\{tv+(1-t)\bar{\gamma}:t\in[0,1],v\in\textrm{supp}(\rho_{X})\}$ and the Hessian satisfies $\sup_{x\in\Omega}\left\|{\nabla^{2}f(x)}\right\|\leq L_{H}$ we have

[TABLE]

Proof.

Assumptions (A1) and (A2) imply $\mathbb{E}[W|Y]=0$ and by the law of total covariance

[TABLE]

Therefore we have $\operatorname{Var}\left({a^{\top}X,Y|{\cal R}}\right)=\operatorname{Var}\left({a^{\top}V,Y|{\cal R}}\right)$ . Furthermore, if $f\in{\cal C}^{2}$ we can use the Taylor expansion of $f$ to rewrite for some $\zeta\in\mathbb{R}^{D}$

[TABLE]

Using that $\nabla f$ is aligned with the tangent field of $\gamma$ (by choice of the parametrization) we have $\nabla f(\bar{\gamma})=\left\|{\nabla f(\bar{\gamma})}\right\|a$ and we get

[TABLE]

The result follows by Lemma 13, and $\operatorname{Cov}\left({a^{\top}V,\varepsilon|{\cal R}}\right)\leq\frac{1}{2}\left|{{\cal S}}\right|\sigma_{\varepsilon}$ which implies

[TABLE]

∎

The last tool required for proving Theorem 2 are the following concentration results for mean and covariance estimation of bounded random variables.

Lemma 15.

Let $A\in\mathbb{R}^{d_{A}\times D}$ and $B\in\mathbb{R}^{d_{B}\times D}$ , and assume $\left\|{A(X-\mathbb{E}X)}\right\|\leq C_{A}$ , $\left\|{B(X-\mathbb{E}X)}\right\|\leq C_{B}$ almost surely. Let $\hat{\mathbb{E}}_{X}$ be the sample mean, and $\hat{\Sigma}$ the sample covariance from $N$ i.i.d. copies of $X$ . For any $u>0$ , we have

[TABLE]

Proof.

The first bound is a standard result that follows from the bounded differences inequality [35]. For (33) denote $\tilde{\Sigma}=\hat{\mathbb{E}}{(X-\mathbb{E}X)(X-\mathbb{E}X)^{\top}}$ and decompose the error into

[TABLE]

By the first result in (33) the second term is of order ${\cal O}(C_{A}C_{B}N^{-1})$ with probability $1-2\exp(u)$ , and can thus be neglected. For the first term, denote $S_{k}:=\frac{1}{N}A\tilde{X}_{k}\tilde{X}_{k}^{\top}B-\frac{1}{N}A\Sigma B$ and $S:=\sum_{k=1}^{N}S_{k}$ , where $\tilde{X}_{k}=X_{k}-\mathbb{E}X$ . Since $\mathbb{E}[\tilde{X}_{k}\tilde{X}_{k}^{\top}]=\Sigma$ we have $\mathbb{E}[S_{k}]=0$ , and since $\tilde{X}_{k}$ and $\tilde{X}_{j}$ are independent for $k\neq j$ we get $\mathbb{E}[S_{k}S_{j}^{\top}]=\mathbb{E}[S_{k}]\mathbb{E}[S_{j}^{\top}]=0$ . Thus,

[TABLE]

Since $\left\|{S_{k}}\right\|\leq 2N^{-1}C_{A}C_{B}$ holds almost surely we have $\left\|{\mathbb{E}SS^{\top}}\right\|\leq 4N^{-1}C_{A}^{2}C_{B}^{2}$ and by an analogous argument we have the same bound for $\left\|{\mathbb{E}S^{\top}S}\right\|$ . Thus, the variance statistic (cf. Remark 25) satisfies $\sqrt{m(S)}\leq 2N^{-1/2}C_{A}C_{B}$ and Theorem 24 yields the desired result. ∎

Proof of Theorem 2

We prove a more detailed version of Theorem 2 given as follows.

Theorem 16.

Let (A1), (A2), (A4) and (A5) hold. Let $u>1$ , ${\cal R}\subset[0,1]$ be a closed interval with $\left|{{\cal R}}\right|>4\sigma_{\varepsilon}$ , and ${\cal S}\subset\operatorname{Im}(\gamma)$ the smallest segment such that $P(V\in{\cal S}|Y\in{\cal R})=1$ . Denote ${\sigma_{Y}}:=\operatorname{Var}\left({a^{\top}X,Y|{\cal R}}\right)(\left|{{\cal R}}\right|\left|{{\cal S}}\right|)^{-1}>0$ , and assume that for some $\alpha\geq 0$ , $C_{W}\geq 2{\sigma_{Y}}\left|{{\cal S}}\right|^{2-\alpha}$

[TABLE]

Furthermore denote the scalars $B_{+}:=B+\left|{{\cal I}}\right|$ ,

[TABLE]

There exists a universal constant $C$ such that whenever $\eta<3$ and

[TABLE]

we have with probability $1-\exp(-u)$

[TABLE]

Proof of Theorem 2 from Theorem 16.

We apply Theorem 16 with $C_{W}^{*}=C_{W}\vee 3L_{f}{\sigma_{{j},Y}}\left|{{\cal R}_{j}}\right|>2{\sigma_{{j},Y}}\left|{{\cal S}_{j}}\right|$ , where the second inequality follows from $\left|{{\cal S}_{j}}\right|\leq 3/2L_{f}\left|{{\cal R}_{j}}\right|$ (Lemma 11) and $\left|{{\cal R}_{j}}\right|>4\sigma_{\varepsilon}$ . Algebraic manipulation reveals that $\eta<3$ is implied by the first condition in (15), and (36) is implied by the second condition in (15). The result follows by $\left|{{\cal S}_{j}}\right|\leq 3/2L_{f}\left|{{\cal R}_{j}}\right|\leq 3/2L_{f}J^{-1}$ . ∎

The proof of Theorem 16 is given at the end of this section because it requires a few tools that we develop first. Bringing forward a step of the proof already now, we obtain the estimate

[TABLE]

Thus, it suffices to bound $\|Qb\|$ , $\|P(\hat{b}-b)\|$ and $\|Q(\hat{b}-b)\|$ from above. In order to achieve optimal dependencies of the bounds with respect to both $\left|{{\cal S}}\right|$ (or $\left|{{\cal R}}\right|$ ) and $N$ , we have to decompose $Qb$ , $P(\hat{b}-b)$ and $Q(\hat{b}-b)$ into separate terms that reflect how $\Sigma,\hat{\Sigma},\Sigma^{\dagger}$ and $\hat{\Sigma}^{\dagger}$ act on $b$ and $\hat{b}$ . This requires three tools: first we analyze spectral norms of $\Sigma^{\dagger}$ when paired with directions $P$ , $Q$ (Lemma 17). Then we need to bound perturbations $A(\hat{\Sigma}^{\dagger}-\Sigma^{\dagger})B$ for $A,B\in\{P,Q\}$ to control the deviation of $\hat{\Sigma}^{\dagger}$ to $\Sigma^{\dagger}$ (Lemma 18). Finally, we need to analyze $r=\operatorname{Cov}\left({X,Y|{\cal R}}\right)$ since $b=\Sigma^{\dagger}r$ , and similarly we require concentration bounds of the finite sample counterpart $\hat{r}$ around $r$ (Lemma 19). These results are then combined to prove Theorem 16.

We begin by analyzing spectral bounds for $\Sigma$ . It will be convenient to use ${\lambda}:=4{\sigma_{Y}}^{2}$ instead of ${\sigma_{Y}}$ since ${\lambda}$ satisfies the relation $\operatorname{Var}\left({a^{\top}X|{\cal R}}\right)\geq{\lambda}\left|{{\cal S}}\right|^{2}$ as we will see below.

Lemma 17.

If (35), (A4), and $\eta<\infty$ hold we have

[TABLE]

Proof.

Establishing (39) is challenging because the eigenspace of $\Sigma$ does not separate into eigenspaces related to $P$ and $Q$ . Instead, we have to relate $\Sigma$ to the auxiliary matrix $\Sigma_{P}:=\operatorname{Cov}\left({PX|{\cal R}}\right)+\operatorname{Cov}\left({QX|{\cal R}}\right)$ . Since we have

[TABLE]

Eqn. (35) implies $\left\|{\Sigma-\Sigma_{P}}\right\|\leq 2{\kappa}C_{W}\left|{{\cal S}}\right|^{1+\alpha}$ , which becomes small when $\left|{{\cal S}}\right|$ tends to [math]. Based on this observation we use the following proof strategy: In the first step, we show that $\Sigma$ and $\Sigma_{P}$ share the same range under the assumptions in the statement. We can then derive the spectral decomposition of $\Sigma_{P}$ in the second step, and use ${\sigma_{Y}}$ and $C_{\perp}$ from (A4) to bound spectral norms of $\Sigma_{P}$ . In the third step we translate these bounds via perturbation theory to $\Sigma$ .

We show $\operatorname{Im}(\Sigma_{P})=\operatorname{Im}(\Sigma)$ . First note that $\operatorname{Im}(\Sigma_{P})=\operatorname{Im}(P\Sigma P)\oplus\operatorname{Im}(Q\Sigma Q)\subset\operatorname{Im}(P\Sigma)\oplus\operatorname{Im}(Q\Sigma)=\operatorname{Im}(\Sigma)$ , which implies that it suffices to show $\operatorname{rank}(\Sigma_{P})=\operatorname{rank}(\Sigma)$ . Since $\eta<\infty$ implies ${\sigma_{Y}}>0$ and therefore $\operatorname{Cov}\left({\left<{a},{V}\right>|{\cal R}}\right)>0$ , we have $\operatorname{rank}(\Sigma_{P})=\operatorname{rank}(P\Sigma_{P}P)+\operatorname{rank}(Q\Sigma_{P}Q)=1+\operatorname{rank}(Q\Sigma_{P}Q)$ . To find a lower bound for $\operatorname{rank}(Q\Sigma_{P}Q)$ , we note that, by (A4), any unit norm $v\in\operatorname{Im}(\Sigma)\cap\operatorname{Im}(Q)$ obeys

[TABLE]

Therefore, $\operatorname{rank}(Q\Sigma_{P}Q)\geq\dim(\operatorname{Im}(\Sigma)\cap\operatorname{Im}(Q))$ . The result now follows by $\dim(\operatorname{Im}(\Sigma)\cap\operatorname{Im}(Q))=\operatorname{rank}(\Sigma)-\dim(\operatorname{Im}(\Sigma)\cap\operatorname{Im}(P))\geq\operatorname{rank}(\Sigma)-1$ .

Denote $d=\operatorname{rank}(\Sigma)=\operatorname{rank}(\Sigma_{P})$ . By construction, the eigendecomposition of $\Sigma_{P}$ is

[TABLE]

where $\{u_{2},\ldots,u_{d}\}$ is an eigensystem for $Q\Sigma Q$ . As $\Sigma_{P}^{\dagger}$ has the same eigen-decomposition with eigenvalues inverted, we have $P\Sigma_{P}^{\dagger}Q=0$ . Furthermore, $\|Q\Sigma_{P}^{\dagger}Q\|\leq 1/C_{\perp}$ follows by (A4). For $P\Sigma_{P}^{\dagger}P$ using Popoviciu’s inequality for the variance of the random variable $Y|{\cal R}$ we get

[TABLE]

which implies $\|P\Sigma_{P}^{\dagger}P\|\leq(4{\sigma_{Y}}^{2}\left|{{\cal S}}\right|^{2})^{-1}={\lambda}^{-1}\left|{{\cal S}}\right|^{-2}$ .

Finally we transfer the bounds on $\Sigma_{P}$ to the true covariance matrix $\Sigma$ . We use the shorthand $\Delta:=\Sigma-\Sigma_{P}$ . We first note that $\operatorname{Im}(\Sigma)=\operatorname{Im}(\Sigma_{P})$ implies the identity $\Sigma^{\dagger}=\Sigma_{P}^{\dagger}-\Sigma^{\dagger}\Delta\Sigma_{P}^{\dagger}$ by [44]. Multiplying with $P,Q$ in different combinations from left and right, and using $P+Q=\mathsf{Id}$ , $P\Delta P=Q\Delta Q=0$ , and $P\Sigma_{P}^{\dagger}Q=0$ we obtain a system of equations given by

[TABLE]

Consider now first $P\Sigma P$ . By plugging (41) into (42) and rearranging the terms, we get

[TABLE]

The matrix $H$ satisfies $\left\|{H}\right\|\leq 4{\kappa}^{2}C_{W}^{2}\left|{{\cal S}}\right|^{2+2\alpha}/(4{\sigma_{Y}}^{2}\left|{{\cal S}}\right|^{2}C_{\perp})=({\kappa}C_{W}\left|{{\cal S}}\right|^{\alpha})^{2}/({\sigma_{Y}}^{2}C_{\perp})<1$ under the condition $\eta<\infty$ . Therefore the inverse of $\mathsf{Id}-H$ is explicitly given by $\sum_{i=0}^{\infty}H^{k}$ by a von Neumann series argument. Using this and submultiplicativity of the spectral norm we get

[TABLE]

By a symmetrie argument, we could have followed the same steps with $Q$ instead, which immediately implies the bound on $Q\Sigma^{\dagger}Q$ . Finally, the bound on the cross term follows from (41) and using the bounds on $P\Sigma^{\dagger}P$ , $\Delta$ and $Q\Sigma_{P}^{\dagger}Q$ . ∎

We shall next bound $P(\hat{\Sigma}^{\dagger}-\Sigma^{\dagger})P$ , $P(\hat{\Sigma}^{\dagger}-\Sigma^{\dagger})Q$ , and $Q(\hat{\Sigma}^{\dagger}-\Sigma^{\dagger})Q$ . This step is the most technical one because we need to keep close track of the dependencies of $\hat{\Sigma}-\Sigma$ on directions they are evaluated in to achieve optimal bounds with respect to both $N$ and $\left|{{\cal S}}\right|$ . In particular, applying Lemma 15 in conjunction with (30) in Lemma 12, we have with probability $1-3\exp(-u)$

[TABLE]

Lemma 18.

Assume (35), (A4), (A5) and $\eta<\infty$ . Fix a confidence level $u>0$ . There exists a universal constant $C$ such that whenever $N\geq\max\{C\eta^{2}\theta^{2}(\log(D)+u)^{2},D\}$ we have with probability $1-\exp(-u)$ simultaneously

[TABLE]

Proof.

We first note that we have $\operatorname{Im}(\hat{\Sigma})=\operatorname{Im}(\Sigma)$ since $N\geq D$ and we assume that $X|Y\in{\cal R}$ is absolutely continuous with respect to $\operatorname{Im}(\Sigma)$ , see Section 2. Now denote the shorthand $\Delta:=\hat{\Sigma}-\Sigma$ . From [44] we obtain the identity

[TABLE]

and by using $P+Q=\mathsf{Id}$ and rearranging the terms, this implies

[TABLE]

Considering only the first two equations, they contain two unknowns $P(\hat{\Sigma}^{\dagger}-\Sigma^{\dagger})P$ and $P(\hat{\Sigma}^{\dagger}-\Sigma^{\dagger})Q$ . Hence we can solve for these unknowns by solving a linear system $SU=R$ with

[TABLE]

It is well-known that, provided $S_{11}$ and $S_{22}-S_{21}S_{11}^{-1}S_{12}$ are invertible, the inverse of $S$ is precisely

[TABLE]

This allows to establish an identity for $Q(\hat{\Sigma}^{\dagger}-\Sigma^{\dagger})P$ by known terms after we have computed related entries of the inverse $S^{-1}$ . This will be our first goal in the following.

Whenever $\left\|{P\Sigma^{\dagger}\Delta P}\right\|<1$ , we have $S_{11}^{-1}=\sum_{k=0}^{\infty}(-P\Sigma^{\dagger}\Delta P)^{k}$ using a von Neumann series argument. Following the same argument, the matrix $S_{22}-S_{21}S_{11}^{-1}S_{12}=\mathsf{Id}+Q\Sigma^{\dagger}\Delta Q-S_{21}S_{11}^{-1}S_{12}$ is invertible whenever, for $H:=Q\Sigma^{\dagger}\Delta Q-S_{21}S_{11}^{-1}S_{12}$ , we have $\left\|{H}\right\|<1$ . In that case

[TABLE]

Taking the supremum norm and using norm submultiplicativity it follows that

[TABLE]

Moreover, we can simplify leading factors in (51) by estimating $\left\|{H}\right\|$ . Specifically we find

[TABLE]

and therefore after algebraic manipulations we get

[TABLE]

Having (51) and (52) established, we now need to bound terms like $\|A\Sigma^{\dagger}\Delta B\|_{2}$ and $\|A\Sigma^{\dagger}\Delta\Sigma^{\dagger}B\|$ where $A,B\in\{P,Q\}$ . This ensures on one hand the invertibility of $P\Sigma^{\dagger}\Delta P$ and $H$ , and on the other hand bounds remaining terms in (51). All bounds are achieved similarly by decomposing them further and using the triangle inequality, e.g. to get

[TABLE]

Then application of Lemma 17 and (44) yields a concentration bound. For simplicity, we list the resulting bounds in Table 4 below. They hold with probability at least $1-3\exp(-u)$ .

Now, let us first ensure the invertibilities of $P\Sigma^{\dagger}\Delta P$ and $H$ that was needed to derive (51). Since $T_{3}T_{4}\leq T_{1}T_{2}$ Eqn. (52) becomes $(1-\left\|{H}\right\|)^{-1}\leq((1-T_{1})(1-T_{2})-T_{1}T_{2})^{-1}$ which is less than $1$ e.g. if $\max\{T_{1},T_{2}\}<1/2$ . Thus it suffices to require

[TABLE]

This is ensured by the assumption $N\geq C\eta^{2}\theta^{2}(\log(D)+u)^{2}$ and therefore $(1-\left\|{H}\right\|)^{-1}\lesssim 1$ , $\left\|{P\Sigma^{\dagger}\Delta P}\right\|\lesssim 1$ . Combining this with (51) we then obtain

[TABLE]

where we used $N\geq C\eta^{2}\theta^{2}(\log(D)+u)^{2}$ again to simplify higher order term. This proves (47).

The remaining two bounds are easier since we can use (47). For (45) we recall (48) and $\left\|{P\Sigma^{\dagger}\Delta P}\right\|<1$ (whenever $N\geq C\eta^{2}\theta^{2}(\log(D)+u)^{2}$ ) to get

[TABLE]

Then, expressing the inverse by a von Neumann series and using $(1-\left\|{P\Sigma^{\dagger}\Delta P}\right\|)^{-1}\lesssim 1$ we get

[TABLE]

where we used again $N\geq C\eta^{2}\theta^{2}(\log(D)+u)^{2}$ to simplify the higher order term. (46) follows similarly by starting from (50). ∎

It remains to analyze the cross-covariance term $r=\operatorname{Cov}\left({X,Y|Y\in{\cal R}}\right)$ , and bounding its concentration when estimated from a finite data set.

Lemma 19.

Assume (A1), (A2). For $r=\operatorname{Cov}\left({X,Y|{\cal R}}\right)$ we have $\left\|{Pr}\right\|={\sigma_{Y}}\left|{{\cal S}}\right|\left|{{\cal R}}\right|$ and $\left\|{Qr}\right\|\leq 1/2\kappa\left|{{\cal S}}\right|^{2}\left|{{\cal R}}\right|$ . Furthermore, let now $\{(X_{i},Y_{i}):i\in[N]\}$ denote $N$ iid. copies of $(X,Y)$ , and denote $\hat{r}=N^{-1}\sum_{i=1}^{N}(X_{i}-\hat{\mathbb{E}}{X_{i}})(Y_{i}-\hat{\mathbb{E}}{Y_{i}})$ . Then we have for $u>1$ concentration results

[TABLE]

Proof.

$\left\|{Pr}\right\|={\sigma_{Y}}\left|{{\cal S}}\right|\left|{{\cal R}}\right|$ is precisely the definition of ${\sigma_{Y}}$ in Theorem 16. For $Qr$ we first recall $\operatorname{Cov}\left({W,Y|{\cal R}}\right)=0$ as in (32). Therefore, we can write $Qr=Q\operatorname{Cov}\left({X,Y|{\cal R}}\right)=Q\operatorname{Cov}\left({V,Y|{\cal R}}\right)$ which satisfies by (28) in Lemma 12

[TABLE]

For the concentration results, we denote $Z_{i}:=(X_{i}-\mathbb{E}X)(Y_{i}-\mathbb{E}Y)-\operatorname{Cov}\left({X,Y}\right)$ , and let $A\in\{P,\mathsf{Id}\}$ . We can decompose the error as

[TABLE]

and notice that, by Lemma 15, the second term is always of higher order. For the first term, we have $\mathbb{E}AZ_{i}=0$ , and

[TABLE]

where $\left\|{A(X-\mathbb{E}X)}\right\|\leq C_{A}$ almost surely. Using (30) in Lemma 12, we can choose $C_{A}=2\left|{{\cal S}}\right|$ if $A=P$ , and $C_{A}=B_{+}$ if $A=\mathsf{Id}$ . The results follows from (33) in Lemma 15. ∎

Proof of Theorem 16.

The proof is divided into three steps. First we use previously established Lemmata 17, 18, and 19 to provide concentration bounds for $\|P(\hat{b}-b)\|$ and $\|Q(\hat{b}-b)\|$ , where we recall $b=\Sigma^{\dagger}r$ and $\hat{b}=\hat{\Sigma}^{\dagger}\hat{r}$ . Then we establish that the bound (38) is indeed true under the conditions of the Theorem. Finally, we use the concentration bounds on $\|P(\hat{b}-b)\|$ and $\|Q(\hat{b}-b)\|$ together with a bound on $\|Qb\|$ to conclude the result.

Let us begin with $\|P(\hat{b}-b)\|$ . We first decompose the error into

[TABLE]

Now we apply Lemma 17, 18, and 19 to bound these terms. The second term has higher order and is thus neglected. For the first term we get with probability $1-2\exp(-u)$

[TABLE]

where we used $\left|{{\cal R}}\right|/\left|{{\cal S}}\right|\lesssim L_{f}$ since $\left|{{\cal S}}\right|\geq L_{f}^{-1}(\left|{{\cal R}}\right|-2\sigma_{\varepsilon})$ by Lemma 11, and $\left|{{\cal R}}\right|>4\sigma_{\varepsilon}$ . For the third term in (53) we have with probability $1-2\exp(-u)$

[TABLE]

where we used that $\eta<\infty$ implies ${\kappa}C_{W}\left|{{\cal S}}\right|^{\alpha}/({\lambda}C_{\perp})\leq 1$ . Since $\theta^{2}{\sigma_{Y}}\geq\theta\max\{1,{\sigma_{Y}}^{-2}\}{\sigma_{Y}}\geq\theta$ the bound for the third term is dominated by the bound on $\|P(\hat{\Sigma}^{\dagger}-\Sigma^{\dagger})r\|$ , and thus we get with probability $1-4\exp(-u)$

[TABLE]

The same strategy is used for $Q(\hat{b}-b)$ . First we decompose into three terms

[TABLE]

and notice that the second term is of higher order. The first term is bounded by

[TABLE]

and for the third summand we get

[TABLE]

As before the first term dominates and thus we have with probability $1-4\exp(-u)$

[TABLE]

Next we prove the error decomposition (38). This first requires to ensure $a^{\top}\hat{b}>0$ (Step 2.1).

2.1 We first note that the definition $b=\Sigma^{\dagger}r$ implies $r=\Sigma b$ . Rewriting $a^{\top}r$ we get

[TABLE]

Furthermore using Lemma 17, 19 and $C_{W}\geq 2{\sigma_{Y}}\left|{{\cal S}}\right|^{2-\alpha}$ , ${\lambda}=4{\sigma_{Y}}^{2}$ we can bound $\|Qb\|$ by

[TABLE]

Plugging this, $a^{\top}r=\operatorname{Var}\left({a^{\top}X,Y|{\cal R}}\right)={\sigma_{Y}}\left|{{\cal S}}\right|\left|{{\cal R}}\right|$ , $a^{\top}\Sigma a=\operatorname{Var}\left({a^{\top}X|{\cal R}}\right)\leq 2\left|{{\cal S}}\right|^{2}$ (Lemma 12), and $\left\|{P\Sigma Q}\right\|\leq{\kappa}C_{W}\left|{{\cal S}}\right|^{1+\alpha}$ into (56), we obtain

[TABLE]

where $\left|{{\cal R}}\right|/\left|{{\cal S}}\right|\geq 1/(2L_{f})$ by Lemma 11 in the last inequality. By the requirement $\eta<3$ it follows that $a^{\top}b>0$ . We can transfer the lower boundedness to the estimate $a^{\top}\hat{b}$ by

[TABLE]

with probability $1-4\exp(-u)$ , and where $C$ is some universal constant. Using the condition (36) that bounds $N$ from below $a^{\top}\hat{b}>0$ with probability $1-4\exp(-u)$ .

2.2 Now we can prove decomposition (38). First notice that Pythagoras gives $\left\|{\hat{a}-a}\right\|^{2}=\left\|{P\hat{a}-a}\right\|^{2}+\left\|{Q\hat{a}}\right\|^{2}$ . Furthermore since $a^{\top}\hat{b}>0$ , we can rewrite $a=\|P\hat{b}\|^{-1}P\hat{b}$ to get

[TABLE]

where we used the triangle inequality in the last step. Therefore, we get $\left\|{\hat{a}-a}\right\|^{2}\leq\left\|{P\hat{a}-a}\right\|^{2}+\left\|{Q\hat{a}}\right\|^{2}\leq 2\|Q\hat{b}\|^{2}\|\hat{b}\|^{-2}$ which implies

[TABLE]

In this final step we combine (58) with the other results of steps 1 and 2. First we notice that the denominator in (58) is bounded from below by $1/16{\sigma_{Y}}(3-\eta)L_{f}^{-1}$ by choosing the universal $C$ in the requirement (36) large enough. $\|Qb\|$ is bounded as in (57), and for $\|Q(\hat{b}-b)\|$ we use the concentration bound (55). ∎

A.1.2 Global analysis

In this part we analyze the global error of approximating the tangent field by proving Corollary 4. The result can be established quickly from Theorem 2 once we ensure that each level set contains sufficiently many samples. Indeed this is the case under (A6) as shown in the following Lemma.

Lemma 20.

Let (A6) hold, and let $\{X_{i}:i\in[N]\}$ be $N$ i.i.d. copies of $X$ . For $0<u<N$ we have

[TABLE]

Furthermore if $\{{\cal X}_{j}:j\in[J]\}$ and $\{{\cal Y}_{j}:j\in[J]\}$ is a partition according to (12) for some $J^{-1}>4\sigma_{\varepsilon}$ and $N>\frac{8L_{f}\left|{{\cal I}}\right|u}{c_{V}}J$ we have

[TABLE]

Proof.

Let $\epsilon=\frac{\left|{{\cal I}}\right|u}{c_{V}N}$ , and $V\in\operatorname{Im}(\gamma)$ . Since (A6) implies $\mathbb{P}\left(V^{\prime}\in{\cal B}_{d_{\gamma}}(V,\varepsilon)\right)>c_{V}\varepsilon\left|{{\cal I}}\right|^{-1}$ , where $V^{\prime}$ is an independent copy of $V$ , we have

[TABLE]

For the second statement let $j\in[J]$ arbitrary and denote ${\cal R}_{j}=[a_{j},b_{j}]$ , ${\cal R}_{j}^{-}=[3/4a_{j}+1/4b_{j},1/4a_{j}+3/4b_{j}]$ . Then, since $J^{-1}=\left|{\mathcal{{\cal R}}_{j}}\right|>4\sigma_{\varepsilon}$ we have $\mathbb{P}(Y\in{\cal R}_{j}|f(X)\in{\cal R}_{j}^{-})=1$ , and thus there exists a segment ${\cal S}_{j}\subset\operatorname{Im}(\gamma)$ with $\left|{{\cal S}_{j}}\right|\geq 1/2L_{f}^{-1}\left|{{\cal R}_{j}}\right|=1/2L_{f}^{-1}J^{-1}$ such that $\mathbb{P}(Y\in{\cal R}_{j}|V\in{\cal S}_{j})=1$ . The result follows from

[TABLE]

where we used $N>\frac{8L_{f}\left|{{\cal I}}\right|u}{c_{V}}J$ to simplify the bound on $\min_{j\in[J]}\left|{{\cal X}_{j}}\right|$ in the first inequality. ∎

Proof of Corollary 4.

Let us first check whether the conditions of Theorem 2 are satisfied for each $j\in[J]$ . Clearly, (17) implies (15) for all $j\in[J]$ . Furthermore the number of samples satisfies with probability exceeding $1-\exp(-u)$ by Lemma 20

[TABLE]

Thus, $\left|{{\cal X}_{j}}\right|$ satisfies (15) for $u\log(J)$ instead of $u$ for all $j\in[J]$ as soon as $C_{N}$ is equal to $C_{N}$ in Theorem 2 multiplied by $4L_{f}\left|{{\cal I}}\right|{c_{V}}^{-1}$ . Denote now $e_{j}:=\|\hat{a}_{j}-a_{j}\|$ . Using Theorem 2 and the union bound we obtain

[TABLE]

where $\tilde{C}_{E}$ equals $C_{E}$ in Theorem 2 up to factors depending on $L_{f},{c_{V}},\left|{{\cal I}}\right|$ . The result follows by using (14) and defining $C_{A}$ as the maximum of $C_{A}$ in Theorem 2 and $\left|{{\cal S}_{j}}\right|\leq 2L_{f}$ . ∎

A.2 Proofs for Section 4

A.2.1 Proofs for Section 4.1

Almost linear curves allow to find an equivalent characterization of the geodesic metric using projections onto the tangent field. This is made precise in the following Lemma and is a key ingredient to establish the metric equivalency in Proposition 6.

Lemma 21.

Let $\gamma:\mathfrak{I}\rightarrow\mathbb{R}^{D}$ be a $\theta$ -almost linear curve. Then for ${t}^{\prime}\geq t$ and $\tilde{t}\textrm{ arbitrary}$

[TABLE]

Proof.

The upper bound follows by Cauchy-Schwartz, $\left\|{\gamma(t)-\gamma({t}^{\prime})}\right\|\leq d_{\gamma}(\gamma(t),\gamma({t}^{\prime}))$ and $\left\|{\gamma^{\prime}(\tilde{t})}\right\|=1$ . For the lower bound the fundamental theorem of calculus gives

[TABLE]

∎

Proof of Proposition 6.

We begin with an intermediate result.

Let $x=v+w,\ x^{\prime}=v^{\prime}+w^{\prime}\in\textrm{supp}(\rho_{X})$ , where $v=\pi_{\gamma}(x)$ and $v^{\prime}=\pi_{\gamma}(x^{\prime})$ , and let $S(v,v^{\prime})\subset\operatorname{Im}(\gamma)$ be the curve segment between $v$ and $v^{\prime}$ . Assume $\gamma|_{S(v,v^{\prime})}$ is $\theta$ -almost linear for $\theta>\kappa(S(v,v^{\prime}))B$ . We will show that for arbitrary $p\in\mathbb{R}^{D}$ we have

[TABLE]

For the first inequality we have $\left|{\left\langle p,x-x^{\prime}\right\rangle}\right|\leq\left\|{x-x^{\prime}}\right\|\left\|{p-a(v^{\prime})}\right\|+\left|{\left\langle a(v^{\prime}),x-x^{\prime}\right\rangle}\right|$ , by Cauchy-Schwartz. The fundamental theorem of calculus and $a(v)\perp w$ , $a(v^{\prime})\perp w^{\prime}$ then yield

[TABLE]

where we used Lemma 21 in the last step. The bound follows after dividing by $1+\kappa\left(S(v,v^{\prime})\right)B$ . For the second inequality in (LABEL:eq:delta_estimate) using Lemma 21, and again the fact that $w^{\prime}\perp a(v^{\prime})$ , we get

[TABLE]

Collecting terms with $d_{\gamma}(v,v^{\prime})$ and dividing through by $\theta-\kappa\left({\cal S}(v,v^{\prime})\right)B$ yields the desired bound. Denote now for short $d:=\left|{{\cal I}}\right|+2B$ . Eqn. (LABEL:eq:delta_estimate) implies in the context of Proposition 6

[TABLE]

since $\kappa(S(v,v^{\prime}))\leq\kappa$ and

[TABLE]

We will now use (62) to establish Proposition 6. Using the left hand side of (62) we get

[TABLE]

where $\max_{i=1,\ldots,k}2d_{\gamma}(x,\bar{x}_{k^{*}(x)})=2d_{\gamma}(x,\bar{x}_{k^{*}(x)})$ by the definition of $k^{*}(x)$ . Then, using the right hand side of (62), the result follows by

[TABLE]

∎

A.2.2 Proofs for Section 4.2

The proof of Proposition 9 is more involved than for Proposition 6 and requires two auxiliary results that will be developed first. The first result states that for any $x\in\textrm{supp}(\rho_{X})$ with $v:=\pi_{\gamma}(x)$ for which there exist another $v^{\prime}\in\operatorname{Im}(\gamma)$ that satisfies the condition $a(v^{\prime})(x-v^{\prime})=0$ (i.e. $x$ lies in the normal ray of $\gamma$ at $v^{\prime}$ ), we necessarily have a minimum distance $\left\|{x-v^{\prime}}\right\|$ . The second result uses this observation to ensure equivalence of $d_{\gamma}(x,\cdot)$ and $\Delta_{\eta}({x},{\cdot})$ under suitable conditions on $\eta$ . We also notice that $\Delta_{\infty}({x},{\bar{x}_{i}})\leq 2d_{\gamma}(x,\bar{x}_{i})+(\left|{{\cal I}}\right|+2B)\left\|{\hat{a}(\bar{x}_{i})-a(\bar{x}_{i})}\right\|$ , which has been proven in (62), remains valid and will be used also here.

Lemma 22.

Assume $x\in\mathbb{R}^{D}$ has a unique projection $v:=\pi_{\gamma}(x)$ , satisfying $\left\|{x-v}\right\|\leq B<\tau_{\gamma}$ . For any $v^{\prime}\neq v\in\operatorname{Im}(\gamma)$ with $\left<{a(v^{\prime})},{(x-v^{\prime})}\right>=0$ we have $\left\|{x-v^{\prime}}\right\|\geq 2\tau_{\gamma}-B$ . Furthermore for any $x^{\prime}$ with $\left\|{x^{\prime}-v^{\prime}}\right\|\leq B<\tau_{\gamma}$ and $\pi_{\gamma}(x^{\prime})=v^{\prime}$ we have $\left\|{x-x^{\prime}}\right\|\geq 2(\tau_{\gamma}-B)$ .

Proof.

First note that by the properties of $\tau_{\gamma}$ we know that for all $z\in\mathbb{R}^{D}$ , such that $\operatorname{dist}({\operatorname{Im}(\gamma)};{z})<\tau_{\gamma}$ , there is only one $v_{z}\in\operatorname{Im}(\gamma)$ such that $\left<{a(v_{z})},{(z-v_{z})}\right>=0$ and $\left\|{z-v_{z}}\right\|<\tau_{\gamma}$ [37, Sec. 4]. Thus, $\left\|{x-v^{\prime}}\right\|\geq\tau_{\gamma}$ . Moreover, for the line $W(t)=v^{\prime}+ts$ , where $s=(x-v^{\prime})/\left\|{X-v^{\prime}}\right\|$ , we have $\operatorname{dist}({\operatorname{Im}(\gamma)};{W(t)})=\left\|{W(t)-v^{\prime}}\right\|=t$ , for all $t\in(0,\tau_{\gamma})$ and $\operatorname{dist}({\operatorname{Im}(\gamma)};{W(t)})=\tau_{\gamma}$ holds for at least one $t^{*}\in[\tau_{\gamma},\left\|{x-v^{\prime}}\right\|)$ .

We now want to show that $\left\|{W(t^{*})-x}\right\|\geq\tau_{\gamma}-B$ . Assume the contrary. Then

[TABLE]

which contradicts $\operatorname{dist}({\operatorname{Im}(\gamma)};{W(t)})=\tau_{\gamma}$ . Since $W(t^{*})$ lies on a line between $v^{\prime}$ and $x$ we have

[TABLE]

The second statement follows from $\left\|{x-x^{\prime}}\right\|\geq\left\|{x-v^{\prime}}\right\|-\left\|{v^{\prime}-x^{\prime}}\right\|\geq 2(\tau_{\gamma}-B)$ . ∎

Lemma 23.

Assume (A5) for $B=(1/2-q)\tau_{\gamma}$ for some $q>0$ . Let $x\in\textrm{supp}(\rho_{X})$ arbitrary, $\bar{x}\in\textrm{supp}(\rho_{X})\cap{\cal B}_{\left\|{\cdot}\right\|}(x,\tau_{\gamma})$ with tangent approximation $\hat{a}(\bar{x})$ . If

[TABLE]

we have $d_{\gamma}(x,\bar{x})\leq 4\Delta_{\tau_{\gamma}}({x},{\bar{x}})+4\tau_{\gamma}\left\|{\hat{a}(\bar{x})-a(\bar{x})}\right\|$ .

Proof.

Let $v:=\pi_{\gamma}(x),\bar{v}:=\pi_{\gamma}(\bar{x})$ , $\omega=\left|{\hat{a}(\bar{x})^{\top}(x-\bar{x})}\right|$ and consider the point $\tilde{x}:=\bar{v}+Q(\bar{x})(x-\bar{v})$ , where $Q(\bar{x}):=\mathsf{Id}-a(\bar{x})a(\bar{x})^{\top}$ . The point $\tilde{x}$ satisfies $a(\bar{x})^{\top}(\tilde{x}-\bar{v})=0$ and, since $a(\bar{x})\perp\bar{x}-\bar{v}$ , it is contained within a small ball around $x$ bounded by

[TABLE]

This also that $\tilde{x}$ itself is not too far from $\operatorname{Im}(\gamma)$ because using the triangle inequality we get

[TABLE]

By $\omega+\tau_{\gamma}\varepsilon_{a}+B<q\tau_{\gamma}+B\leq 1/2\tau_{\gamma}\leq 1/2\tau_{\gamma}$ , it follows that $\tilde{x}$ has a unique projection $\tilde{v}:=\pi_{\gamma}(\tilde{x})$ . From now, the proof follows two steps. We first show $\pi_{\gamma}(\tilde{x})=\bar{v}$ by contradiction, which is then used for bounding $d_{\gamma}(x,\bar{x})$ .

Assume $\pi_{\gamma}(\tilde{x})\neq\bar{v}$ . We have constructed $\tilde{x}$ with $a(\bar{x})^{\top}(\tilde{x}-\bar{v})=0$ and $\left\|{\tilde{x}-\tilde{v}}\right\|\leq\omega+\tau_{\gamma}\varepsilon_{a}+B$ . Lemma 22 immediately implies the lower bound

[TABLE]

Using then $\bar{x}\in{\cal B}_{\left\|{\cdot}\right\|}(x,\tau_{\gamma})$ , $\|x-\bar{x}\|\geq\|\tilde{x}-\bar{v}\|-\|x-\tilde{x}\|-\|\bar{v}-\bar{x}\|$ from the triangle inequality, and $B=(1/2-q)\tau_{\gamma}$ , we have the inequality

[TABLE]

This implies with $\omega\geq(q-\varepsilon_{a})\tau_{\gamma}$ a contradiction to Condition (63).

Using first $\max\{\left\|{x-v}\right\|,\left\|{\tilde{x}-\tilde{v}}\right\|\}\leq\omega+\tau_{\gamma}\varepsilon_{a}+B$ and the Lipschitz-property of $\pi_{\gamma}$ (see [9, Theorem 4.8 (8)]), and then $\omega<(q-\varepsilon_{a})\tau_{\gamma}$ , we get

[TABLE]

Furthermore since $\omega<(q-\varepsilon_{a})<(1/4-\varepsilon_{a})\tau_{\gamma}$ we have $\left\|{v-\bar{v}}\right\|\leq 2(\omega+\tau_{\gamma}\varepsilon_{a})<\tau_{\gamma}/2$ , and thus we can apply [37, Proposition 6.3] to get

[TABLE]

∎

Proof of Proposition 9.

Define $\varepsilon_{a}:=\max_{i\in[N]}\left\|{\hat{a}(\bar{x}_{i})-a(\bar{x}_{i})}\right\|$ and note that $k\delta<(\eta-2B)$ implies $\left\|{x-\bar{x}_{i^{*}(x)}}\right\|\leq k\delta+2B<\eta$ , hence $\bar{x}_{i^{*}(x)}\in{\cal B}_{\left\|{\cdot}\right\|}(x,\eta)$ for all $i\in[k]$ . This similarly implies $\{\bar{x}_{i(X)}:i\in[k]\}\subset B_{\left\|{\cdot}\right\|}(x,\eta)$ , and by using the left hand side of (62) we get the bound

[TABLE]

By Lemma 23 we get $d_{\gamma}(x,\bar{x}_{k(x)})\leq 4\Delta_{\eta}({x},{\bar{x}_{k(x)}})+4\tau_{\gamma}\varepsilon_{a}$ and the result follows from

[TABLE]

∎

A.3 Referenced results

Theorem 24 (Matrix Bernstein, 6.1.1. in [43]).

Consider a finite sequence $S_{k}$ of independent, random matrices, with common dimension $d_{1}\times d_{2}$ and assume that $\mathbb{E}[S_{k}]=\boldsymbol{0},$ and $\left\|{S_{k}}\right\|\leq L,\,\forall k.$ Define the random matrix $S=\sum_{k=1}^{N}S_{k}$ , and the matrix variance statistic

[TABLE]

Then for all $\epsilon\geq 0$ we have the tail bound

[TABLE]

Remark 25.

Let us make a short comment regarding Theorem 24. Jensen’s inequality gives

[TABLE]

Hence, it is sufficient to bound $\mathbb{E}\left\|{S}\right\|^{2}$ . Moreover, (67) holds if we replace $m(S)$ with its upper bound $\mu\geq m(S)$ . Rewriting now the right hand side of (67) as

[TABLE]

for $u>0$ , leads to a quadratic equation for $\epsilon$ , the solution of which is given as

[TABLE]

Algebraic manipulation shows that this can be bounded by $\epsilon\leq C\max\left(L,\sqrt{\nu}\right)\left(u+\log(d_{1}+d_{2})\right)$ for some universal constant $C>0$ . Finally, monotonicity of probability gives $\mathbb{P}\left(\left\|{S}\right\|\geq\epsilon\right)\geq\mathbb{P}\left(\left\|{S}\right\|\geq\epsilon^{\prime}\right)$ for $\epsilon\leq\epsilon^{\prime}$ . Thus, for every $u>0$

[TABLE]

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Adragni, K. P. & Cook, R. D. (2009) Sufficient dimension reduction and prediction in regression. Philosophical Transactions of the Royal Society , 367 , 4385–4405.
2[2] Balabdaoui, F., Groeneboom, P. & Hendrickx, K. (2019) Score estimation in the monotone single-index model. Scandinavian Journal of Statistics , 46 (2), 517–544.
3[3] Bickel, P. J., Li, B. et al. (2007) Local polynomial regression on unknown manifolds. in Complex Datasets and Inverse Problems , vol. 54, pp. 177–186. Institute of Mathematical Statistics.
4[4] Brillinger, D. R. (1983) A generalized linear model with “Gaussian” regressor variables. in Selected Works of David Brillinger , pp. 589–606. Springer.
5[5] Chen, Y. & Samworth, R. J. (2016) Generalized additive and index models with shapeconstraints. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78 (4), 429–754.
6[6] Cheng, L., Zeng, P. & Zhu, Y. (2017) BS-SIM: An effective variable selection method for high-dimensional single index model. Electron. J. Statist. , 11 (2), 3522–3548.
7[7] Dalalyan, A. S., Juditsky, A. & Spokoiny, V. (2008) A new algorithm for estimating the effective dimension-reduction subspace. Journal of Machine Learning Research , 9 , 1647–1678.
8[8] Dennis Cook, R. (2000) SAVE: a method for dimension reduction and graphics in regression. Communications in statistics-Theory and methods , 29 , 2109–2121.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

**

Abstract

1 Introduction

Related work.

Main idea and estimation procedure for the NSIM model.

Computational complexity.

Contributions and organization of the paper.

General notation.

2 Theoretical framework for the NSIM model

Regularity assumptions for fff and Im⁡(γ)\operatorname{Im}(\gamma)Im(γ).

Distributional assumptions.

Lemma 1**.**

Proof.

3 Learning localized index vectors

Theorem 2**.**

Remark 3** (Special cases of Theorem 2).**

Corollary 4**.**

4 Function estimation

4.1 Function estimation for perturbed single index models

Definition 5**.**

Proposition 6**.**

Theorem 7**.**

Proof.

Remark 8** (Special cases of Theorem 7).**

4.2 Extension to general curves

Proposition 9**.**

Proposition 10**.**

5 Numerical Experiments

5.1 Experiments with synthetic data

General setup.

Discussion

5.2 Real data

Data sets.

Estimators.

Discussion.

Interpretability.

6 Conclusions

Supplementary Materials

Funding

Acknowledgements

Appendix A Appendix

A.1 Proofs for Section 3

A.1.1 Local analysis

Auxiliary results

Lemma 11**.**

Proof.

Lemma 12**.**

Proof.

Lemma 13**.**

Proof.

Lemma 14**.**

Proof.

Lemma 15**.**

Proof.

Proof of Theorem 2

Theorem 16**.**

Proof of Theorem 2 from Theorem 16.

Lemma 17**.**

Proof.

Lemma 18**.**

Proof.

Lemma 19**.**

Proof.

Proof of Theorem 16.

A.1.2 Global analysis

Lemma 20**.**

Proof.

Proof of Corollary 4.

A.2 Proofs for Section 4

A.2.1 Proofs for Section 4.1

Lemma 21**.**

Proof.

Regularity assumptions for $f$ and $\operatorname{Im}(\gamma)$ .

Lemma 1.

Theorem 2.

Remark 3 (Special cases of Theorem 2).

Corollary 4.

Definition 5.

Proposition 6.

Theorem 7.

Remark 8 (Special cases of Theorem 7).

Proposition 9.

Proposition 10.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.

Lemma 15.

Theorem 16.

Lemma 17.

Lemma 18.

Lemma 19.

Lemma 20.

Lemma 21.

Lemma 22.

Lemma 23.

Theorem 24 (Matrix Bernstein, 6.1.1. in [43]).

Remark 25.