Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU   Networks : Function Approximation and Statistical Recovery

Minshuo Chen; Haoming Jiang; Wenjing Liao; Tuo Zhao

arXiv:1908.01842·cs.LG·February 24, 2022

Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery

Minshuo Chen, Haoming Jiang, Wenjing Liao, Tuo Zhao

PDF

TL;DR

This paper demonstrates that deep ReLU networks can effectively perform nonparametric regression on data supported on low-dimensional manifolds, achieving fast convergence rates that depend on the intrinsic dimension rather than the ambient space.

Contribution

The paper introduces a deep ReLU network architecture for nonparametric regression on manifolds and proves its convergence rate depends on the intrinsic dimension, showing adaptivity to geometric structures.

Findings

01

Convergence rate of $n^{-rac{2(s+eta)}{2(s+eta)+d}}\

02

Deep ReLU networks adapt to low-dimensional manifold structures in high-dimensional data.

03

Theoretical analysis supports the effectiveness of deep networks for geometric data approximation.

Abstract

Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of H\"{o}lder functions on low-dimensional manifolds using deep ReLU networks. Suppose $n$ training data are sampled from a H\"{o}lder function in $H^{s, α}$ supported on a $d$ -dimensional Riemannian manifold isometrically embedded in $R^{D}$ , with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of $n^{- \frac{2 ( s + α )}{2 ( s + α ) + d}} lo g^{3} n$ . This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension $d$ , which is usually much smaller than the ambient dimension $D$ . It…

Equations442

y_{i} = f_{0} (x_{i}) + ξ_{i},

y_{i} = f_{0} (x_{i}) + ξ_{i},

f (x) = W_{L} \cdot ReLU (W_{L - 1} \dots ReLU (W_{1} x + b_{1}) \dots + b_{L - 1}) + b_{L},

f (x) = W_{L} \cdot ReLU (W_{L - 1} \dots ReLU (W_{1} x + b_{1}) \dots + b_{L - 1}) + b_{L},

F (R, κ, L, p, K)

F (R, κ, L, p, K)

\displaystyle\left\lVert f\right\rVert_{\infty}\leq R,\left\lVert W_{i}\right\rVert_{\infty,\infty}\leq\kappa,\left\lVert\mathbf{b}_{i}\right\rVert_{\infty}\leq\kappa~{}\textrm{for}~{}i=1,\dots,L,\sum_{i=1}^{L}\left\lVert W_{i}\right\rVert_{0}+\left\lVert\mathbf{b}_{i}\right\rVert_{0}\leq K\big{\}},

f_{n} = argmin_{f \in F (R, κ, L, p, K)} R_{n} (f) = argmin_{f \in F (R, κ, L, p, K)} \frac{1}{n} i = 1 \sum n (f (x_{i}) - y_{i})^{2} .

f_{n} = argmin_{f \in F (R, κ, L, p, K)} R_{n} (f) = argmin_{f \in F (R, κ, L, p, K)} \frac{1}{n} i = 1 \sum n (f (x_{i}) - y_{i})^{2} .

L = O (\frac{s + α}{2 ( s + α ) + d} lo g n), p = O (n^{\frac{d}{2 ( s + α ) + d}}), K = O (\frac{s + α}{2 ( s + α ) + d} n^{\frac{d}{2 ( s + α ) + d}} lo g n), R = ∥ f_{0} ∥_{\infty},

L = O (\frac{s + α}{2 ( s + α ) + d} lo g n), p = O (n^{\frac{d}{2 ( s + α ) + d}}), K = O (\frac{s + α}{2 ( s + α ) + d} n^{\frac{d}{2 ( s + α ) + d}} lo g n), R = ∥ f_{0} ∥_{\infty},

E [\int_{M} (f_{n} (x) - f_{0} (x))^{2} d D_{x} (x)] \leq c (R^{2} + σ^{2}) (n^{- \frac{2 ( s + α )}{2 ( s + α ) + d}} + \frac{D}{n}) lo g^{3} n,

E [\int_{M} (f_{n} (x) - f_{0} (x))^{2} d D_{x} (x)] \leq c (R^{2} + σ^{2}) (n^{- \frac{2 ( s + α )}{2 ( s + α ) + d}} + \frac{D}{n}) lo g^{3} n,

ϕ \circ ψ^{- 1} : ψ (U \cap V) \mapsto ϕ (U \cap V) and ψ \circ ϕ^{- 1} : ϕ (U \cap V) \mapsto ψ (U \cap V)

ϕ \circ ψ^{- 1} : ψ (U \cap V) \mapsto ϕ (U \cap V) and ψ \circ ϕ^{- 1} : ϕ (U \cap V) \mapsto ψ (U \cap V)

U_{1} = {(x, y, z) ∣ x > 0}, P_{1} (x, y, z) = (y, z), U_{2} = {(x, y, z) ∣ x < 0}, P_{2} (x, y, z) = (y, z),

U_{1} = {(x, y, z) ∣ x > 0}, P_{1} (x, y, z) = (y, z), U_{2} = {(x, y, z) ∣ x < 0}, P_{2} (x, y, z) = (y, z),

U_{3} = {(x, y, z) ∣ y > 0}, P_{3} (x, y, z) = (x, z), U_{4} = {(x, y, z) ∣ y < 0}, P_{4} (x, y, z) = (x, z),

U_{5} = {(x, y, z) ∣ z > 0}, P_{5} (x, y, z) = (x, y), U_{6} = {(x, y, z) ∣ z < 0}, P_{6} (x, y, z) = (x, y) .

f \circ ψ^{- 1} = (f \circ ϕ^{- 1}) \circ (ϕ \circ ψ^{- 1}) .

f \circ ψ^{- 1} = (f \circ ϕ^{- 1}) \circ (ϕ \circ ψ^{- 1}) .

\displaystyle\left|D^{\mathbf{s}}(f\circ{\sf P}_{i}^{-1})\big{|}_{{\sf P}_{i}(\mathbf{x}_{1})}-D^{\mathbf{s}}(f\circ{\sf P}_{i}^{-1})\big{|}_{{\sf P}_{i}(\mathbf{x}_{2})}\right|\leq\left\lVert{\sf P}_{i}(\mathbf{x}_{1})-{\sf P}_{i}(\mathbf{x}_{2})\right\rVert_{2}^{\alpha}.

\displaystyle\left|D^{\mathbf{s}}(f\circ{\sf P}_{i}^{-1})\big{|}_{{\sf P}_{i}(\mathbf{x}_{1})}-D^{\mathbf{s}}(f\circ{\sf P}_{i}^{-1})\big{|}_{{\sf P}_{i}(\mathbf{x}_{2})}\right|\leq\left\lVert{\sf P}_{i}(\mathbf{x}_{1})-{\sf P}_{i}(\mathbf{x}_{2})\right\rVert_{2}^{\alpha}.

f_{i} \circ ϕ_{i}^{- 1} = (f \circ ϕ_{i}^{- 1}) \times (ρ_{i} \circ ϕ_{i}^{- 1})

f_{i} \circ ϕ_{i}^{- 1} = (f \circ ϕ_{i}^{- 1}) \times (ρ_{i} \circ ϕ_{i}^{- 1})

C (M) = {x \in R^{D} : \exists p \neq = q \in M, ∥ p - x ∥_{2} = ∥ q - x ∥_{2} = y \in M in f ∥ y - x ∥_{2}}

C (M) = {x \in R^{D} : \exists p \neq = q \in M, ∥ p - x ∥_{2} = ∥ q - x ∥_{2} = y \in M in f ∥ y - x ∥_{2}}

τ = x \in M, y \in C (M) in f ∥ x - y ∥_{2} .

τ = x \in M, y \in C (M) in f ∥ x - y ∥_{2} .

L = O (\frac{s + α}{2 ( s + α ) + d} lo g n), p = O (n^{\frac{d}{2 ( s + α ) + d}}), K = O (\frac{s + α}{2 ( s + α ) + d} n^{\frac{d}{2 ( s + α ) + d}} lo g n),

L = O (\frac{s + α}{2 ( s + α ) + d} lo g n), p = O (n^{\frac{d}{2 ( s + α ) + d}}), K = O (\frac{s + α}{2 ( s + α ) + d} n^{\frac{d}{2 ( s + α ) + d}} lo g n),

R = ∥ f_{0} ∥_{\infty}, and κ = O (max {1, B, d, τ^{2}}) .

E [\int_{M} (f_{n} (x) - f_{0} (x))^{2} d D_{x} (x)] \leq c (R^{2} + σ^{2}) (n^{- \frac{2 ( s + α )}{2 ( s + α ) + d}} + \frac{D}{n}) lo g^{3} n,

E [\int_{M} (f_{n} (x) - f_{0} (x))^{2} d D_{x} (x)] \leq c (R^{2} + σ^{2}) (n^{- \frac{2 ( s + α )}{2 ( s + α ) + d}} + \frac{D}{n}) lo g^{3} n,

g (a) = max {- R, min {a, R}} = ReLU (a - R) - ReLU (a + R) - R .

g (a) = max {- R, min {a, R}} = ReLU (a - R) - ReLU (a + R) - R .

g (x) = 2 ReLU (x) - 4 ReLU (x - 1/2) + 2 ReLU (x - 1),

g (x) = 2 ReLU (x) - 4 ReLU (x - 1/2) + 2 ReLU (x - 1),

C_{M} \leq ⌈ \frac{S A ( M )}{r ^{d}} T_{d} ⌉,

C_{M} \leq ⌈ \frac{S A ( M )}{r ^{d}} T_{d} ⌉,

T_{c_{i}} (M) = span (v_{i 1}, \dots, v_{i d}),

T_{c_{i}} (M) = span (v_{i 1}, \dots, v_{i d}),

ϕ_{i} (x) = b_{i} (V_{i}^{⊤} (x - c_{i}) + u_{i}) \in [0, 1]^{d}

ϕ_{i} (x) = b_{i} (V_{i}^{⊤} (x - c_{i}) + u_{i}) \in [0, 1]^{d}

d_{i}^{2} (x) = ∥ x - c_{i} ∥_{2}^{2} = j = 1 \sum D (x_{j} - c_{i, j})^{2}

d_{i}^{2} (x) = ∥ x - c_{i} ∥_{2}^{2} = j = 1 \sum D (x_{j} - c_{i, j})^{2}

d_{i}^{2} (x) = 4 B^{2} j = 1 \sum D h_{sq} (\frac{x _{j} - c _{i, j}}{2 B}) .

d_{i}^{2} (x) = 4 B^{2} j = 1 \sum D h_{sq} (\frac{x _{j} - c _{i, j}}{2 B}) .

\mathds 1_{Δ} (a) = ⎩ ⎨ ⎧ 1 - \frac{1}{Δ - 8 B ^{2} D ν} a + \frac{r ^{2} - 4 B ^{2} D ν}{Δ - 8 B ^{2} D ν} 0 a \leq r^{2} - Δ + 4 B^{2} D ν a \in [r^{2} - Δ + 4 B^{2} D ν, r^{2} - 4 B^{2} D ν] a > r^{2} - 4 B^{2} D ν,

\mathds 1_{Δ} (a) = ⎩ ⎨ ⎧ 1 - \frac{1}{Δ - 8 B ^{2} D ν} a + \frac{r ^{2} - 4 B ^{2} D ν}{Δ - 8 B ^{2} D ν} 0 a \leq r^{2} - Δ + 4 B^{2} D ν a \in [r^{2} - Δ + 4 B^{2} D ν, r^{2} - 4 B^{2} D ν] a > r^{2} - 4 B^{2} D ν,

g_{k} (a) = k g \circ \dots \circ g (a) = ⎩ ⎨ ⎧ 0 2^{k} (a - r^{2} + 4 B^{2} D ν) + r^{2} - 4 B^{2} D ν r^{2} - 4 B^{2} D ν a < (1 - 2^{- k}) (r^{2} - 4 B^{2} D ν) a \in [(1 - \frac{1}{2 ^{k}}) (r^{2} - 4 B^{2} D ν), r^{2} - 4 B^{2} D ν] a > r^{2} - 4 B^{2} D ν .

g_{k} (a) = k g \circ \dots \circ g (a) = ⎩ ⎨ ⎧ 0 2^{k} (a - r^{2} + 4 B^{2} D ν) + r^{2} - 4 B^{2} D ν r^{2} - 4 B^{2} D ν a < (1 - 2^{- k}) (r^{2} - 4 B^{2} D ν) a \in [(1 - \frac{1}{2 ^{k}}) (r^{2} - 4 B^{2} D ν), r^{2} - 4 B^{2} D ν] a > r^{2} - 4 B^{2} D ν .

f = i = 1 \sum C_{M} f_{i} with f_{i} = f ρ_{i},

f = i = 1 \sum C_{M} f_{i} with f_{i} = f ρ_{i},

\displaystyle\left|D^{\mathbf{s}}(f_{i}\circ\phi_{i}^{-1})\big{|}_{\phi_{i}(\mathbf{x}_{1})}-D^{\mathbf{s}}(f_{i}\circ\phi_{i}^{-1})\big{|}_{\phi_{i}(\mathbf{x}_{2})}\right|\leq L_{i}\left\lVert\phi_{i}(\mathbf{x}_{1})-\phi_{i}(\mathbf{x}_{2})\right\rVert_{2}^{\alpha},\quad\forall\mathbf{x}_{1},\mathbf{x}_{2}\in U_{i}.

\displaystyle\left|D^{\mathbf{s}}(f_{i}\circ\phi_{i}^{-1})\big{|}_{\phi_{i}(\mathbf{x}_{1})}-D^{\mathbf{s}}(f_{i}\circ\phi_{i}^{-1})\big{|}_{\phi_{i}(\mathbf{x}_{2})}\right|\leq L_{i}\left\lVert\phi_{i}(\mathbf{x}_{1})-\phi_{i}(\mathbf{x}_{2})\right\rVert_{2}^{\alpha},\quad\forall\mathbf{x}_{1},\mathbf{x}_{2}\in U_{i}.

D^{s} (f_{i} \circ ϕ_{i}^{- 1}) = D^{s} (g_{1} \times g_{2}) = ∣ p ∣ + ∣ q ∣ = s \sum (∣ p ∣ s) D^{p} g_{1} D^{q} g_{2} .

D^{s} (f_{i} \circ ϕ_{i}^{- 1}) = D^{s} (g_{1} \times g_{2}) = ∣ p ∣ + ∣ q ∣ = s \sum (∣ p ∣ s) D^{p} g_{1} D^{q} g_{2} .

\displaystyle\big{|}D^{\mathbf{p}}g_{1}D^{\mathbf{q}}g_{2}|_{\phi_{i}(\mathbf{x}_{1})}-D^{\mathbf{p}}g_{1}D^{\mathbf{q}}g_{2}|_{\phi_{i}(\mathbf{x}_{2})}\big{|}

\displaystyle\big{|}D^{\mathbf{p}}g_{1}D^{\mathbf{q}}g_{2}|_{\phi_{i}(\mathbf{x}_{1})}-D^{\mathbf{p}}g_{1}D^{\mathbf{q}}g_{2}|_{\phi_{i}(\mathbf{x}_{2})}\big{|}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods*Communicated@Fast*How Do I Communicate to Expedia?

Full text

Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery

Minshuo Chen, Haoming Jiang, Wenjing Liao, Tuo Zhao Alphabetical order. Minshuo Chen, Haoming Jiang, and Tuo Zhao are affiliated with School of Industrial and Systems Engineering at Georgia Tech; Wenjing Liao is affiliated with School of Mathematics at Georgia Tech; Email: $\{$ mchen393, hmjiang, tourzhao, wliao60 $\}$ @gatech.edu.

Abstract

Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of Hölder functions on low-dimensional manifolds using deep ReLU networks. Suppose $n$ training data are sampled from a Hölder function in $\mathcal{H}^{s,\alpha}$ supported on a $d$ -dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^{D}$ , with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of $n^{-\frac{2(s+\alpha)}{2(s+\alpha)+d}}\log^{3}n$ . This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension $d$ , which is usually much smaller than the ambient dimension $D$ . It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometric structures of data, and partially explains the power of deep ReLU networks in tackling high-dimensional data with low-dimensional geometric structures.

1 Introduction

Deep learning has made astonishing breakthroughs in various real-world applications, such as computer vision (Krizhevsky et al., 2012; Goodfellow et al., 2014; Long et al., 2015), natural language processing (Graves et al., 2013; Bahdanau et al., 2014; Young et al., 2018), healthcare (Miotto et al., 2017; Jiang et al., 2017), robotics (Gu et al., 2017), etc. For example, in image classification, the winner of the $2017$ ImageNet challenge retained a top- $5$ error rate of $2.25\%$ (Hu et al., 2018), while the data set consists of about $1.2$ million labeled high resolution images in $1000$ categories. In speech recognition, Amodei et al. (2016) reported that deep neural networks outperformed humans with a $5.15\%$ word error rate on the LibriSpeech corpus constructed from audio books (Panayotov et al., 2015). Such a data set consists of approximately $1000$ hours of $16$ kHz read English speech from $8000$ audio books.

The empirical success of deep learning brings new challenges to the conventional wisdom of machine learning. Data sets in these applications are in high-dimensional spaces. In existing literature, a minimax lower bound has been established for the optimal algorithm of learning $C^{s}$ functions in $\mathbb{R}^{D}$ (Györfi et al., 2006; Tsybakov, 2008). Denote the underlying function by $f_{0}$ . The minimax lower bound suggests a pessimistic sample complexity: To obtain an estimator $\widehat{f}$ for each $C^{s}$ function $f_{0}$ with an $\epsilon$ -error, uniformly for all $C^{s}$ functions (i.e., $\sup_{f_{0}\in C^{s}}\|\widehat{f}-f_{0}\|_{L_{2}}\leq\epsilon$ with $\|\cdot\|_{L_{2}}$ denoting the function $L_{2}$ norm), the optimal algorithm requires the sample size $n\gtrsim\epsilon^{-\frac{2s+D}{s}}$ in the worst scenario (i.e., when $f_{0}$ is the most difficult for the algorithm to estimate). We instantiate such a sample complexity bound to the ImageNet data set, which consists of RGB images with a resolution of $224\times 224$ . The theory above suggests that, to achieve an $\epsilon$ -error, the number of samples has to scale as $\epsilon^{-224\times 224\times 3/s}$ , where the smoothness parameter $s$ is significantly smaller than $224\times 224\times 3$ . Setting $\epsilon=0.1$ already gives rise to a huge number of samples far beyond practical applications, which well exceeds $1.2$ million labeled images in ImageNet.

To bridge the aforementioned gap between theory and practice, we take the low-dimensional geometric structures in data sets into consideration. This is motivated by the fact that real-world data sets often exhibit low-dimensional structures. Many images consist of projections of a three-dimensional object followed by some transformations, such as rotation, translation, and skeleton. This generating mechanism induces a small number of intrinsic parameters (Hinton and Salakhutdinov, 2006; Osher et al., 2017). Speech data are composed of words and sentences following the grammar, and therefore have small degrees of freedom (Djuric et al., 2015). More broadly, visual, acoustic, textual, and many other types of data often have low-dimensional geometric structures due to rich local regularities, global symmetries, repetitive patterns, or redundant sampling (Tenenbaum et al., 2000; Roweis and Saul, 2000; Coifman et al., 2005; Allard et al., 2012). It is therefore reasonable to assume that data lie on a manifold $\mathcal{M}$ of dimension $d\ll D$ .

1.1 Summary of main results

In this paper, we study nonparametric regression problems (Wasserman, 2006; Györfi et al., 2006; Tsybakov, 2008) using neural networks in exploitation of low-dimensional geometric structures of data. Specifically, we model data as samples from a probability measure supported on a $d$ -dimensional Riemannian manifold $\mathcal{M}$ isometrically embedded in $\mathbb{R}^{D}$ where $d\ll D$ . The goal is to recover the regression function $f_{0}$ supported on $\mathcal{M}$ using the samples $S_{n}=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ with $\mathbf{x}\in\mathcal{M}$ and $y\in\mathbb{R}$ . The $\mathbf{x}_{i}$ ’s are i.i.d. sampled from a distribution $\mathcal{D}_{x}$ on $\mathcal{M}$ , and the response $y_{i}$ satisfies

[TABLE]

where $\xi_{i}$ ’s are i.i.d. sub-Gaussian noise independent of $\mathbf{x}_{i}$ ’s.

We use multi-layer ReLU (Rectified Linear Unit) neural networks to recover $f_{0}$ . ReLU networks are widely used in computer vision, speech recognition, natural language processing, etc. (Nair and Hinton, 2010; Glorot et al., 2011; Maas et al., 2013). These networks can ease the notorious vanishing gradient issue during training, which commonly arises with sigmoid or hyperbolic tangent activations (Glorot et al., 2011; Goodfellow et al., 2016). Given an input $\mathbf{x}$ , an $L$ -layer ReLU neural network computes an output as

[TABLE]

where $W_{1},\dots,W_{L}$ and $\mathbf{b}_{1},\dots,\mathbf{b}_{L}$ are weight matrices and vectors of proper sizes, respectively, and $\textrm{ReLU}(\cdot)$ denotes the entrywise rectified linear unit (i.e., $\textrm{ReLU}(a)=\max\{0,a\}$ ). We denote $\mathcal{F}$ as a class of neural networks with bounded weight parameters and bounded output (we refer to $\mathcal{F}$ as a ReLU network structure throughout the rest of the paper):

[TABLE]

where $\left\lVert\cdot\right\rVert_{0}$ denotes the number of nonzero entries in a vector or a matrix, $\left\lVert\cdot\right\rVert_{\infty}$ denotes $\ell_{\infty}$ norm of a function or entrywise $\ell_{\infty}$ norm of a vector. For a matrix $M$ , we have $\left\lVert M\right\rVert_{\infty,\infty}=\max_{i,j}|M_{ij}|$ .

To obtain an estimator $\widehat{f}\in\mathcal{F}(R,\kappa,L,p,K)$ of $f_{0}$ , we minimize the empirical quadratic risk

[TABLE]

The subscript $n$ emphasizes that the estimator is obtained using $n$ pairs of samples. Our theory shows that $\widehat{f}_{n}$ converges to $f_{0}$ at a fast rate depending on the intrinsic dimension $d$ , under some mild regularity conditions. We assume $f_{0}\in\mathcal{H}^{s+\alpha}(\mathcal{M})$ is an $(s+\alpha)$ -Hölder function on $\mathcal{M}$ , where $s>0$ is an integer and $\alpha\in(0,1]$ . For the network class $\mathcal{F}(R,\kappa,L,p,K)$ , we choose

[TABLE]

and set $\kappa$ as a constant depending on $s$ , $f_{0}$ , and $\mathcal{M}$ . Here we use $\widetilde{O}$ to hide factors depending on $s,d$ and logarithmic factors (e.g., $\log D$ ). Then the empirical minimizer $\widehat{f}_{n}$ of (3) gives rise to

[TABLE]

where the expectation is taken over the training samples $S_{n}$ , $\sigma^{2}$ is the variance proxy of sub-Gaussian noise $\xi_{i}$ , and $c$ is a constant depending on $\log D$ , $s$ , $\kappa$ , and $\mathcal{M}$ (see a formal statement in Theorem 2).

Our theory implies that, in order to estimate an $(s+\alpha)$ -Hölder function up to an $\epsilon$ -error, the sample complexity is $n\gtrsim\epsilon^{-\frac{2(s+\alpha)+d}{s+\alpha}}$ up to a log factor. This sample complexity depends on the intrinsic dimension $d$ , and thus largely improves on existing theories of nonparametric regression using neural networks, where the sample complexity scales as $\widetilde{O}(\epsilon^{-\frac{2(s+\alpha)+D}{s+\alpha}})$ (Hamers and Kohler, 2006; Kohler and Krzyżak, 2005, 2016; Kohler and Mehnert, 2011; Schmidt-Hieber, 2017). Our result partially explains the success of deep ReLU neural networks in tackling high-dimensional data with low-dimensional geometric structures.

An ingredient in our analysis is an efficient universal approximation theory of deep ReLU networks for $(s+\alpha)$ -Hölder functions on $\mathcal{M}$ (Theorem 1). A preliminary version of the approximation theory appeared in Chen et al. (2019). Specifically, we show that, in order to uniformly approximate $(s+\alpha)$ -Hölder functions on a $d$ -dimensional manifold with an $\epsilon$ -error, the network consists of at most $\widetilde{O}(\log 1/\epsilon+\log D)$ layers and $\widetilde{O}(\epsilon^{-d/(s+\alpha)}\log 1/\epsilon+D\log 1/\epsilon+D\log D)$ neurons and weight parameters (see Theorem 1). The network size in our approximation theory weakly depends on the data dimension $D$ , which significantly improves on existing universal approximation theories of neural networks (Barron, 1993; Mhaskar, 1996; Lu et al., 2017; Hanin, 2017; Yarotsky, 2017), where the network size scales as $\widetilde{O}(\epsilon^{-D/(s+\alpha)})$ . Figure 1 illustrates a huge gap between the network sizes used in practice (Tan and Le, 2019) and the required size predicted by existing theories, e.g., Yarotsky (2017) for the ImageNet data set. Our approximation theory partially bridges this gap by exploiting the data intrinsic geometric structures, and justifies why neural networks of moderate size have achieved a great success in various applications. Meanwhile, our network size also matches its lower bound up to logarithmic factors for a given manifold $\mathcal{M}$ (see Proposition 2).

1.2 Related Work

Nonparametric regression has been widely studied in statistics. A variety of methods has been proposed to estimate the regression function, including kernel methods, wavelets, splines, and local polynomials (Wahba, 1990; Altman, 1992; Fan and Gijbels, 1996; Tsybakov, 2008; Györfi et al., 2006). Nonetheless, there is limited study on regression using deep ReLU networks until recently. The earliest works focused on neural networks with a single hidden layer and smooth activations (e.g., sigmoidal and sinusoidal functions, (Barron, 1991; McCaffrey and Gallant, 1994)). Later results achieved the minimax lower bound for the mean squared error in the order of $O(n^{-\frac{2s}{2s+D}})$ up to a logarithmic factor for $C^{s}$ functions in $\mathbb{R}^{D}$ (Hamers and Kohler, 2006; Kohler and Krzyżak, 2005, 2016; Kohler and Mehnert, 2011). Theories for deep ReLU networks were developed in Schmidt-Hieber (2017), where the estimate matches the minimax lower bound up to a logarithmic factor for Hölder functions. Extensions to more general function spaces, such as Besov spaces, can be found in Suzuki (2019) and results for classification problems can be found in Kim et al. (2018); Ohn and Kim (2019).

The rate of convergence in the results above cannot fully explain the success of deep learning due to the curse of the data dimension with a large $D$ . Fortunately, many real-world data sets exhibit low-dimensional geometric structures. It has been demonstrated that, some classical methods are adaptive to the low-dimensional structures of data sets, and perform as well as if the low-dimensional structures were known. Results in this direction include local linear regression (Bickel and Li, 2007; Cheng and Wu, 2013), multiscale polynomial regression (Liao et al., 2021), $k$ -nearest neighbor (Kpotufe, 2011), kernel regression (Kpotufe and Garg, 2013), and Bayesian Gaussian process regression (Yang et al., 2015), where optimal rates depending on the intrinsic dimension were proved for functions having the second order of continuity (Bickel and Li, 2007), globally Lipschitz functions (Kpotufe, 2011), and Hölder functions with Hölder index no more than $1$ (Kpotufe and Garg, 2013).

Recently, several independent works (Schmidt-Hieber, 2019; Nakada and Imaizumi, 2020; Cloninger and Klock, 2020) justified the adaptability of deep neural networks to the low-dimensional data structures. Schmidt-Hieber (2019) considered function approximation and regression of Hölder functions on a low-dimensional manifold, which is similar to the setup in this paper. The proofs in Schmidt-Hieber (2019) and this paper both utilize a collection of charts to map each point on $\mathcal{M}$ into a local coordinate in $\mathbb{R}^{d}$ , and then approximate functions in $\mathbb{R}^{d}$ . There are two differences in the detailed proof: (1) In exploitation of a positive reach property of $\mathcal{M}$ , we construct local coordinates on the manifold given by orthogonal projections onto the tangent spaces, while Schmidt-Hieber (2019) assumed the existence of smooth local coordinates; (2) A main novelty of our work is to explicitly construct a chart determination sub-network which assigns each data point to its proper chart. In Schmidt-Hieber (2019), the chart determination is realized by the partition of unity. In order to approximate functions in $\mathcal{H}^{s,\alpha}(\mathcal{M})$ , Schmidt-Hieber (2019) required a uniform upper bound on the derivatives of each coordinate map and each function in the partition of unity, up to order $(s+\alpha)D/d$ . Our proof does not rely on such regularity conditions depending on the ambient dimension $D$ . To describe the intrinsic dimensionality of data, Nakada and Imaizumi (2020) applied the notion of Minkowski dimension, which can be defined for a broader class of sets without smoothness restrictions. The intrinsic dimension of manifolds and the Minkowski dimension are different notions for low-dimensional sets, and one does not naturally imply the other. Schmidt-Hieber (2019) and Nakada and Imaizumi (2020) established a $O(n^{-\frac{2(s+\alpha)}{2(s+\alpha)+d}})$ convergence rate of the mean squared error for learning functions in $\mathcal{H}^{s,\alpha}(\mathcal{M})$ , where $d$ is the manifold dimension in Schmidt-Hieber (2019) and Minkowski dimension in Nakada and Imaizumi (2020), respectively. Recently Cloninger and Klock (2020) studied the approximation and regression error of ReLU neural networks for a class of functions in the form of $f(\mathbf{x})=g(\pi_{\mathcal{M}}(\mathbf{x}))$ , where $\mathbf{x}$ is near the low-dimensional manifold $\mathcal{M}$ , $\pi_{\mathcal{M}}$ is a projection onto $\mathcal{M}$ , and $g$ is a Hölder function on $\mathcal{M}$ .

A crucial ingredient in the statistical analysis of neural networks is the universal approximation ability of neural networks. Early works in literature justified the existence of two-layer networks with continuous sigmoidal activations (a function $\sigma(x)$ is sigmoidal, if $\sigma(x)\rightarrow 0$ as $x\rightarrow-\infty$ , and $\sigma(x)\rightarrow 1$ as $x\rightarrow\infty$ ) for a universal approximation of continuous functions in a unit hypercube (Irie and Miyake, 1988; Funahashi, 1989; Cybenko, 1989; Hornik, 1991; Chui and Li, 1992; Leshno et al., 1993). In these works, the number of neurons was not explicitly given. Later, Barron (1993); Mhaskar (1996) proved that the number of neurons can grow as $\epsilon^{-D/2}$ where $\epsilon$ is the uniform approximation error. Recently, Lu et al. (2017); Hanin (2017) and Daubechies et al. (2019) extended the universal approximation theory to networks of bounded width with ReLU activations. The depth of such networks grows exponentially with respect to the dimension of data. Yarotsky (2017) showed that ReLU neural networks can uniformly approximate functions in Sobolev spaces, where the network size scales exponentially with respect to the data dimension and matches the lower bound. Zhou (2019) also developed a universal approximation theory for deep convolutional neural networks (Krizhevsky et al., 2012), where the depth of the network scales exponentially with respect to the data dimension.

The aforementioned results focus on functions on a compact subset (e.g., $[0,1]^{D}$ ) in $\mathbb{R}^{D}$ . Function approximation on manifolds has been well studied using classical methods, such as local polynomials (Bickel and Li, 2007) and wavelets (Coifman and Maggioni, 2006). However, studies using neural networks are limited. Two noticeable works are Chui and Mhaskar (2016) and Shaham et al. (2018). In Chui and Mhaskar (2016), high order differentiable functions on manifolds are approximated by neural networks with smooth activations, e.g., sigmoid activations and rectified quadratic unit functions ( $\max^{2}\{0,x\}$ ). These smooth activations are not commonly used in mainstream applications such as computer vision (Krizhevsky et al., 2012; Long et al., 2015; Hu et al., 2018). In Shaham et al. (2018), a $4$ -layer network with ReLU activations was proposed to approximate $C^{2}$ functions on low-dimensional manifolds. This theory does not cover arbitrarily $C^{s}$ functions. We are also aware of a concurrent work of ours, Shen et al. (2019), which established an approximation theory of ReLU networks for Hölder functions in terms of a modulus of continuity. When the target function belongs to the Hölder class $\mathcal{H}^{0,\alpha}$ supported in a neighborhood of a $d$ -dimensional manifold embedded in $\mathbb{R}^{D}$ , Shen et al. (2019) constructed a ReLU network which yields an approximation error in the order of $N^{-2\alpha/{d_{\delta}}}L^{-2\alpha/{d_{\delta}}}$ where $N$ and $L$ are the width and depth of the network, and $d<d_{\delta}<D$ . Their proof utilizes a different approach compared to ours: They first construct a piecewise constant function to approximate the target function, and then implement the piecewise constant function using a ReLU network. The higher order smoothness for $\mathcal{H}^{s,\alpha}$ functions while $s+\alpha>1$ is not exploited due to the use of piecewise constant approximations.

1.3 Roadmap and Notations

The rest of the paper is organized as follows: Section 2 presents a brief introduction to manifolds and functions on manifolds. Section 3 presents a statistical estimation theory of functions on low-dimensional manifolds using deep ReLU neural networks, and a universal approximation theory; Section 4 sketches the proof of the approximation theory. Section 5 sketches the proof of the statistical estimation theory in Section 3, and the detailed proofs are deferred to Appendix; Section 6 provides a conclusion of the paper.

We use bold-faced letters to denote vectors, and normal font letters with a subscript to denote its coordinate, e.g., $\mathbf{x}\in\mathbb{R}^{d}$ and $x_{k}$ being the $k$ -th coordinate of $\mathbf{x}$ . Given a vector $\mathbf{s}=[s_{1},\dots,s_{d}]^{\top}\in\mathbb{N}^{d}$ , we define $\mathbf{s}!=\prod_{i=1}^{d}s_{i}!$ and $|\mathbf{s}|=\sum_{i=1}^{d}s_{i}$ . We define $\mathbf{x}^{\mathbf{s}}=\prod_{i=1}^{d}x_{i}^{s_{i}}$ . Given a function $f:\mathbb{R}^{d}\mapsto\mathbb{R}$ , we denote its derivative as $D^{\mathbf{s}}f=\frac{\partial^{|\mathbf{s}|}f}{\partial x_{1}^{s_{1}}\dots\partial x_{d}^{s_{d}}}$ , and its $\ell_{\infty}$ norm as $\left\lVert f\right\rVert_{\infty}=\max_{\mathbf{x}}|f(\mathbf{x})|$ . We use $\circ$ to denote the composition operator.

2 Preliminaries

We briefly review manifolds, partition of unity, and function spaces defined on smooth manifolds. Details can be found in Tu (2010) and Lee (2006). Let $\mathcal{M}$ be a $d$ -dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^{D}$ .

Definition 1 (Chart).

A chart for $\mathcal{M}$ is a pair $(U,\phi)$ such that $U\subset\mathcal{M}$ is open and $\phi:U\mapsto\mathbb{R}^{d},$ where $\phi$ is a homeomorphism (i.e., bijective, $\phi$ and $\phi^{-1}$ are both continuous).

The open set $U$ is called a coordinate neighborhood, and $\phi$ is called a coordinate system on $U$ . A chart essentially defines a local coordinate system on $\mathcal{M}$ . Given a suitable coordinate neighborhood $U$ around a point $\mathbf{c}$ on the manifold $\mathcal{M}$ , we denote ${\sf P}_{\mathbf{c}}$ as the orthogonal projection onto the tangent space at $\mathbf{c}$ , which gives a particular coordinate system on $U$ .

Example 1 (Projection to Tangent Space).

Let $T_{\mathbf{c}}(\mathcal{M})$ be the tangent space of $\mathcal{M}$ at the point $\mathbf{c}\in\mathcal{M}$ (see the formal definition in Tu (2010, Section 8.1)). We denote $\mathbf{v}_{1},\dots,\mathbf{v}_{d}$ as an orthonormal basis of $T_{\mathbf{c}}(\mathcal{M})$ . Then the orthogonal projection onto the tangent space $T_{\mathbf{c}}(\mathcal{M})$ is defined as ${\sf P}_{\mathbf{c}}(\mathbf{x})=V^{\top}(\mathbf{x}-\mathbf{c})$ for $\mathbf{x}\in U$ with $V=[\mathbf{v}_{1},\dots,\mathbf{v}_{d}]\in\mathbb{R}^{D\times d}$ .

We say two charts $(U,\phi)$ and $(V,\psi)$ on $\mathcal{M}$ are $C^{k}$ compatible if and only if the transition functions,

[TABLE]

are both $C^{k}$ .

Definition 2 ( $C^{k}$ Atlas).

A $C^{k}$ atlas for $\mathcal{M}$ is a collection of pairwise $C^{k}$ compatible charts $\{(U_{i},\phi_{i})\}_{i\in\mathcal{A}}$ such that $\bigcup_{i\in\mathcal{A}}U_{i}=\mathcal{M}$ .

Definition 3 (Smooth Manifold).

A smooth manifold is a manifold together with a $C^{\infty}$ atlas.

Classical examples of smooth manifolds are the Euclidean space $\mathbb{R}^{D}$ , the torus, and the unit sphere. We further define a Riemannian manifold as a pair $(\mathcal{M},g)$ , where $\mathcal{M}$ is a smooth manifold and $g$ is a Riemannian metric (Lee, 2018, Chapter 2). To better interpret Definition 2 and 3, we give an example of a $C^{\infty}$ atlas on the unit sphere in $\mathbb{R}^{3}$ .

Example 2.

We denote $\mathbb{S}^{2}$ as the unit sphere in $\mathbb{R}^{3}$ , i.e., $x^{2}+y^{2}+z^{2}=1$ . The following atlas of $\mathbb{S}^{2}$ consists of $6$ overlapping charts $(U_{1},{\sf P}_{1}),\dots,(U_{6},{\sf P}_{6})$ corresponding to hemispheres:

[TABLE]

Here ${\sf P}_{i}$ is the orthogonal projection onto the tangent space at the pole of each hemisphere. Moreover, all the six charts are $C^{\infty}$ compatible, and therefore, $(U_{1},{\sf P}_{1}),\dots,(U_{6},{\sf P}_{6})$ form an atlas of $\mathbb{S}^{2}$ .

For a general compact smooth manifold $\mathcal{M}$ , we can construct an atlas using orthogonal projections to tangent spaces as local coordinate systems. Let ${\sf P}_{\mathbf{c}}$ be the orthogonal projection to the tangent space $T_{\mathbf{c}}(\mathcal{M})$ for $\mathbf{c}\in\mathcal{M}$ . Let $U_{\mathbf{c}}$ be an open coordinate neighborhood containing $\mathbf{c}$ such that ${\sf P}_{\mathbf{c}}$ is a homeomorphism. Since $\mathcal{M}$ is compact, there exist a finite number of points $\{\mathbf{c}_{i}\}$ such that the charts $\{(U_{\mathbf{c}_{i}},{\sf P}_{\mathbf{c}_{i}})\}$ form an atlas of $\mathcal{M}$ .

The existence of an atlas on $\mathcal{M}$ allows us to define differentiable functions.

Definition 4 ( $C^{s}$ Functions on $\mathcal{M}$ ).

Let $\mathcal{M}$ be a $d$ -dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^{D}$ . A function $f:\mathcal{M}\mapsto\mathbb{R}$ is $C^{s}$ if for any chart $(U,\phi)$ , the composition $f\circ\phi^{-1}:\phi(U)\mapsto\mathbb{R}$ is continuously differentiable up to order $s$ .

Remark 1.

The definition of $C^{s}$ functions is independent of the choice of the chart $(U,\phi)$ . Suppose $(V,\psi)$ is another chart and $V\bigcap U\neq\emptyset$ . Then we have

[TABLE]

Since $\mathcal{M}$ is a smooth manifold, $(U,\phi)$ and $(V,\psi)$ are $C^{\infty}$ compatible. Thus, $f\circ\phi^{-1}$ is $C^{s}$ and $\phi\circ\psi^{-1}$ is $C^{\infty}$ , and their composition is $C^{s}$ .

We next generalize the definition of $C^{s}$ functions to Hölder functions on the smooth manifold $\mathcal{M}$ .

Definition 5 (Hölder Functions on $\mathcal{M}$ ).

Let $\mathcal{M}$ be a $d$ -dimensional compact Riemannian manifold isometrically embedded in $\mathbb{R}^{D}$ . Let $\{(U_{i},{\sf P}_{i})\}_{i\in\mathcal{A}}$ be an atlas of $\mathcal{M}$ where the ${\sf P}_{i}$ ’s are orthogonal projections onto tangent spaces. For a positive integer $s$ and $\alpha\in(0,1]$ , a function $f:\mathcal{M}\mapsto\mathbb{R}$ is $(s+\alpha)$ -Hölder continuous if for each chart $(U_{i},{\sf P}_{i})$ in the atlas, we have

$f\circ{\sf P}_{i}^{-1}\in C^{s}$ with $|D^{\mathbf{s}}(f\circ{\sf P}_{i}^{-1})|\leq 1$ for any $|\mathbf{s}|\leq s,\mathbf{x}\in U_{i}$ ; 2. 2.

for any $|\mathbf{s}|=s$ and $\mathbf{x}_{1},\mathbf{x}_{2}\in U_{i}$ ,

[TABLE]

Moreover, we denote the collection of $(s+\alpha)$ -Hölder functions on $\mathcal{M}$ as $\mathcal{H}^{s,\alpha}(\mathcal{M})$ .

Definition 5 requires that all $s$ -th order derivatives of $f\circ{\sf P}_{i}^{-1}$ are Hölder continuous. We recover the standard Hölder class on a Euclidean space if ${\sf P}_{i}$ is the identity mapping. We next introduce the partition of unity, which plays a crucial role in our construction of neural networks.

Definition 6 (Partition of Unity, Definition 13.4 in Tu (2010)).

A $C^{\infty}$ partition of unity on a manifold $\mathcal{M}$ is a collection of nonnegative $C^{\infty}$ functions $\rho_{i}:\mathcal{M}\mapsto\mathbb{R}_{+}$ for $i\in\mathcal{A}$ such that

the collection of supports, $\{\textrm{supp}(\rho_{i})\}_{i\in\mathcal{A}}$ is locally finite, i.e., every point on $\mathcal{M}$ has a neighborhood that meets only finitely many of ${\rm supp}(\rho_{i})$ ’s; 2. 2.

$\displaystyle\sum\rho_{i}=1$ .

For a smooth manifold, a $C^{\infty}$ partition of unity always exists.

Proposition 1 (Existence of a $C^{\infty}$ partition of unity, Theorem 13.7 in Tu (2010)).

Let $\{U_{i}\}_{i\in\mathcal{A}}$ be an open cover of a compact smooth manifold $\mathcal{M}$ . Then there is a $C^{\infty}$ partition of unity $\{\rho_{i}\}_{i\in\mathcal{A}}$ where every $\rho_{i}$ has a compact support such that $\textrm{supp}(\rho_{i})\subset U_{i}$ .

Proposition 1 gives rise to the decomposition $f=\sum_{i=1}^{\infty}f_{i}$ with $f_{i}=f\rho_{i}$ . Note that the $f_{i}$ ’s have the same regularity as $f$ , since

[TABLE]

for a chart $(U_{i},\phi_{i})$ . This decomposition implies that we can express $f$ as a sum of the $f_{i}$ ’s, where every $f_{i}$ is only supported in a single chart.

To characterize the curvature of a manifold, we adopt the following geometric concept.

Definition 7 (Reach (Federer, 1959), Definition 2.1 in Aamari et al. (2019)).

Denote

[TABLE]

as the set of points that have at least two nearest neighbors on $\mathcal{M}$ . The reach $\tau>0$ is defined as

[TABLE]

Reach has a straightforward geometrical interpretation: At each point $\mathbf{x}\in\mathcal{M}$ , the radius of the osculating circle is greater or equal to $\tau$ . Intuitively, a large reach for $\mathcal{M}$ requires the manifold $\mathcal{M}$ not to change “rapidly” as shown in Figure 2.

In our proof for the universal approximation theory, reach determines a proper choice of an atlas for $\mathcal{M}$ . In Section 4, we choose each chart $U_{i}$ to be contained in a ball of radius less than $\tau/2$ . For smooth manifolds with a small $\tau$ , we need a large number of charts. Therefore, reach of a smooth manifold reflects the complexity of the neural network for function approximation on $\mathcal{M}$ .

3 Main Results

This section contains our main statistical estimation theory for Hölder functions on low-dimensional manifolds using deep neural networks. We begin with some assumptions on the regression model and the manifold $\mathcal{M}$ .

Assumption 1.

$\mathcal{M}$ is a $d$ -dimensional compact Riemannian manifold isometrically embedded in $\mathbb{R}^{D}$ . There exists a constant $B>0$ such that, for any point $\mathbf{x}\in\mathcal{M}$ , we have $|x_{j}|\leq B$ for all $j=1,\dots,D$ .

Assumption 2.

The reach of $\mathcal{M}$ is $\tau>0$ .

Assumption 3.

The ground truth function $f_{0}:\mathcal{M}\mapsto\mathbb{R}$ belongs to the Hölder space $\mathcal{H}^{s,\alpha}(\mathcal{M})$ with a positive integer $s$ and $\alpha\in(0,1]$ .

Assumption 4.

The noise $\xi_{i}$ ’s are i.i.d. sub-Gaussian with $\mathbb{E}[\xi_{i}]=0$ and variance proxy $\sigma^{2}$ , which are independent of the $\mathbf{x}_{i}$ ’s.

3.1 Universal Approximation Theory

An accurate estimation of the nonparametric regression function $f_{0}$ necessitates the existence of a good approximation of $f_{0}$ by our learning models — neural networks. To aid the choice of a proper neural network class for learning $f_{0}$ , we first investigate the following questions:

•

Given a desired approximation error $\epsilon>0$ , does there exist a ReLU neural network which universally represents Hölder functions supported on $\mathcal{M}$ ?

•

If the answer is yes, what is the corresponding network architecture?

We provide a positive answer in the theorem below and defer the proof to Section 4.

Theorem 1.

Suppose Assumptions 1 and 2 hold. Given any $\epsilon\in(0,1)$ , there exists a ReLU network structure $\mathcal{F}(\cdot,\kappa,L,p,K)$ , such that, for any $f:\mathcal{M}\rightarrow\mathbb{R}$ satisfying Assumption 3, if the weight parameters of the network are properly chosen, the network yields a function $\widetilde{f}$ satisfying $\lVert\widetilde{f}-f\rVert_{\infty}\leq\epsilon.$ Such a network has

no more than $L=c_{1}(\log\frac{1}{\epsilon}+\log D)$ layers, with width bounded by $p=c_{2}(\epsilon^{-\frac{d}{s+\alpha}}+D)$ , 2. 2.

at most $K=c_{3}(\epsilon^{-\frac{d}{s+\alpha}}\log\frac{1}{\epsilon}+D\log\frac{1}{\epsilon}+D\log D)$ neurons and weight parameters, with the range of weight parameters bounded by $\kappa=c_{4}\max\{1,B,\tau^{2},\sqrt{d}\}$ ,

where $c_{1},c_{2},c_{3}$ depend on $d$ , $s$ , $\tau$ , $B$ , the surface area of $\mathcal{M}$ , and the upper bounds on the derivatives of the coordinate systems $\phi_{i}$ ’s and the $\rho_{i}$ ’s in the partition of unity, up to order $s$ , and $c_{4}$ depends on the upper bound on the derivatives of the $\rho_{i}$ ’s, up to order $s$ .

This network class $\mathcal{F}$ will be used later to estimate a regression function in Theorem 2. Our approximation theory does not require the output range to be bounded by $R$ in the network class (or equivalently by setting $R=+\infty$ ). The enforcement of $\|f\|_{\infty}\leq R$ is to be imposed for regression in order to control the variance in statistical estimations.

The network structure identified by Theorem 1 consists of three sub-networks as shown in Figure 3 (The detailed construction of each sub-network is postponed to Section 4):

•

Chart determination sub-network, which assigns each input to its corresponding neighborhood;

•

Taylor approximation sub-network, which approximates $f$ by polynomials in each neighborhood;

•

Pairing sub-network, which yields multiplications of the proper pairs of the outputs from the chart determination and the Taylor approximation sub-networks.

Theorem 1 significantly improves on existing approximation theories (Yarotsky, 2017), where the network size grows exponentially with respect to the ambient dimension $D$ , i.e. $\epsilon^{-D/(s+\alpha)}$ . Theorem 1 also improves Shaham et al. (2018) for $C^{s}$ functions in the case that $s>2$ . When $s>2$ , our network size scales like $\epsilon^{-d/s}$ , which is significantly smaller than the one in Shaham et al. (2018) in the order of $\epsilon^{-d/2}$ .

Our approximation theory can be directly generalized to the Sobolev space $\mathcal{W}^{k,\infty}$ , which is embedded in $C^{k}$ . The reason is that our proof of Theorem 1 relies on local Taylor polynomial approximations of Hölder functions. For general Sobolev spaces $\mathcal{W}^{k,p}$ , one needs to consider averaged Taylor polynomials and the Bramble-Hilbert lemma (Brenner and Scott, 2007, Lemma 4.3.8). We refer to Gühring et al. (2020) for readers’ interests.

Moreover, the size of our ReLU network in Theorem 1 matches the lower bound in DeVore et al. (1989) up to a logarithmic factor for the approximation of functions in the Hölder space $\mathcal{H}^{s-1,1}([0,1]^{d})$ defined on $[0,1]^{d}$ .

Proposition 2.

Fix $d$ and $s$ . Let $W$ be a positive integer and $\mathcal{T}:\mathbb{R}^{W}\mapsto C([0,1]^{d})$ be any mapping. Suppose there is a continuous map $\Theta:\mathcal{H}^{s-1,1}([0,1]^{d})\mapsto\mathbb{R}^{W}$ such that $\lVert f-\mathcal{T}(\Theta(f))\rVert_{\infty}\leq\epsilon$ for any $f\in\mathcal{H}^{s-1,1}([0,1]^{d})$ . Then $W\geq c\epsilon^{-\frac{d}{s}}$ with $c$ depending on $s$ only.

We take $\mathbb{R}^{W}$ as the parameter space of a ReLU network, and $\mathcal{T}$ as the transformation given by the ReLU network. Theorem 2 implies that, to approximate any $f\in\mathcal{H}^{s-1,1}([0,1]^{d})$ , the ReLU network needs to have at least $c\epsilon^{-\frac{d}{s}}$ weight parameters. Although Proposition 2 holds for functions defined on $[0,1]^{d}$ , our network size remains in the same order up to a logarithmic factor even when the function is supported on a manifold of dimension $d$ .

On the other hand, the lower bound also reveals that the low-dimensional manifold model plays a vital role to reduce the network size. To uniformly approximate functions in $\mathcal{H}^{s-1,1}([0,1]^{D})$ with an accuracy $\epsilon$ , the minimal number of weight parameters is $O(\epsilon^{-\frac{D}{s}})$ . This lower bound cannot be improved without low-dimensional structures of data.

3.2 Statistical Estimation Theory

Based on Theorem 1, we next present our main regression theorem, which characterizes the convergence rate for the estimation of $f_{0}$ using ReLU neural networks.

Theorem 2.

Suppose Assumptions 1 - 3 hold. Let $\widehat{f}_{n}$ be the minimizer of empirical risk (3) with the network class $\mathcal{F}(R,\kappa,L,p,K)$ properly designed such that

[TABLE]

Then we have

[TABLE]

where the expectation is taken over the training samples $S_{n}$ , and $c$ is a constant depending on $\log D$ , $d$ , $s$ , $\tau$ , $B$ , the surface area of $\mathcal{M}$ , and the upper bounds of derivatives of the coordinate systems $\phi_{i}$ ’s and partition of unity $\rho_{i}$ ’s, up to order $s$ .

Theorem 2 is established by a bias-variance trade-off. We decompose the mean squared error to a squared bias term and a variance term. The bias is quantified by Theorem 1, and the variance term is proportional to the network size. A detailed proof of Theorem 2 is provided in Section 5. Here are some remarks:

The network class in Theorem 2 is sparsely connected, i.e. $K=O(Lp)$ , while densely connected networks satisfy $K=O(Lp^{2})$ . 2. 2.

The network class $\mathcal{F}(R,\kappa,L,p,K)$ has outputs uniformly bounded by $R$ . Such a requirement can be achieved by appending an additional clipping layer to the end of the network structure, i.e.,

[TABLE] 3. 3.

Each weight parameter in our network class is bounded by a constant $\kappa$ only depending on the curvature $\tau$ , the range $B$ of the manifold $\mathcal{M}$ , and the manifold dimension $d$ . Such a boundedness condition is crucial to our theory and can be computationally realized by normalization after each step of the stochastic gradient descent.

4 Proof of Approximation Theory

This section contains a proof sketch of Theorem 1. Before we proceed, we show how to approximate the multiplication operation using ReLU networks. This operation is heavily used in the Taylor approximation sub-network, since Taylor polynomials involve a sum of products. We first show ReLU networks can approximate quadratic functions.

Lemma 1 (Proposition $2$ in Yarotsky (2017)).

The function $f(x)=x^{2}$ with $x\in[0,1]$ can be approximated by a ReLU network with any error $\epsilon>0$ . The network has depth and the number of neurons and weight parameters no more than $c\log(1/\epsilon)$ with an absolute constant $c$ , and the width of the network is an absolute constant.

This lemma is proved in Appendix A.1. The idea is to approximate quadratic functions using a weighted sum of a series of sawtooth functions. Those sawtooth functions are obtained by compositing the triangular function

[TABLE]

which can be implemented by a single layer ReLU network.

We then approximate the multiplication operation by invoking the identity $ab=\frac{1}{4}((a+b)^{2}-(a-b)^{2})$ where the two squares can be approximated by ReLU networks in Lemma 1.

Corollary 1 (Proposition $3$ in Yarotsky (2017)).

Given a constant $C>0$ and $\epsilon\in(0,C^{2})$ , there is a ReLU network which implements a function $\widehat{\times}:\mathbb{R}^{2}\mapsto\mathbb{R}$ such that: 1). For all inputs $x$ and $y$ satisfying $|x|\leq C$ and $|y|\leq C$ , we have $|\widehat{\times}(x,y)-xy|\leq\epsilon$ ; 2). The depth and the weight parameters of the network is no more than $c\log\frac{C^{2}}{\epsilon}$ with an absolute constant $c$ .

The ReLU network in Theorem 1 is constructed in the following 5 steps.

Step 1. Construction of an atlas. Denote the open Euclidean ball with center $\mathbf{c}$ and radius $r$ in $\mathbb{R}^{D}$ by $\mathcal{B}(\mathbf{c},r)$ . For any $r$ , the collection $\{\mathcal{B}(\mathbf{x},r)\}_{\mathbf{x}\in\mathcal{M}}$ is an open cover of $\mathcal{M}$ . Since $\mathcal{M}$ is compact, there exists a finite collection of points $\mathbf{c}_{i}$ for $i=1,\dots,C_{\mathcal{M}}$ such that $\mathcal{M}\subset\bigcup_{i}\mathcal{B}(\mathbf{c}_{i},r).$

The following lemma says that when the radius $r$ is properly chosen, $U_{i}=\mathcal{B}(\mathbf{c}_{i},r)\cap\mathcal{M}$ is diffeomorphic to $\mathbb{R}^{d}$ .

Lemma 2.

Suppose Assumption 1 and 2 hold and let $r\leq\tau/4$ . Then the local neighborhood $U_{i}=\mathcal{B}(\mathbf{c}_{i},r)\cap\mathcal{M}$ is diffeomorphic to $\mathbb{R}^{d}$ . In particular, the orthogonal projection ${\sf P}_{i}$ onto the tangent space $T_{\mathbf{c}_{i}}(\mathcal{M})$ at $\mathbf{c}_{i}$ is a diffeomorphism.

The proof is provided in Appendix B.1, which utilizes the results in Niyogi et al. (2008). Therefore, we pick radius $r\leq\tau/4$ , and let $\{(U_{i},\phi_{i})\}_{i=1}^{C_{\mathcal{M}}}$ be an atlas on $\mathcal{M}$ as illustrated in Figure 4, where $\phi_{i}$ is

to be defined in Step 2. The number of charts $C_{\mathcal{M}}$ is upper bounded by

[TABLE]

where $SA(M)$ is the surface area of $\mathcal{M}$ , and $T_{d}$ is the thickness of the $U_{i}$ ’s, which is defined as the average number of $U_{i}$ ’s that contain a point on $\mathcal{M}$ (See Eq. (1) in Chapter $2$ of Conway et al. (1987)).

Remark 2.

The thickness $T_{d}$ scales approximately linear in $d$ . As shown in Eq. (19) in Chapter $2$ of Conway et al. (1987), there exist coverings with $\frac{d}{e\sqrt{e}}\lesssim T_{d}\leq d\log d+d\log\log d+5d$ .

Step 2. Projection with rescaling and translation. We denote the tangent space at $\mathbf{c}_{i}$ as

[TABLE]

where $\{\mathbf{v}_{i1},\dots,\mathbf{v}_{id}\}$ form an orthonormal basis. We obtain the matrix $V_{i}=[\mathbf{v}_{i1},\dots,\mathbf{v}_{id}]\in\mathbb{R}^{D\times d}$ by concatenating the $\mathbf{v}_{ij}$ ’s as column vectors.

Define

[TABLE]

for any $\mathbf{x}\in U_{i}$ , where $b_{i}\in(0,1]$ is a scaling factor and $\mathbf{u}_{i}$ is a translation vector. Since $U_{i}$ is bounded, we can choose proper $b_{i}$ and $\mathbf{u}_{i}$ to guarantee $\phi_{i}(\mathbf{x})\in[0,1]^{d}$ . We rescale and translate the projection to ease the notation for the development of local Taylor approximations in Step 4. We also remark that each $\phi_{i}$ is a linear function, and can be realized by a single layer linear network.

Step 3. Chart determination. This step is to assign a given input $\mathbf{x}$ to the proper charts to which $\mathbf{x}$ belongs. This avoids projecting $\mathbf{x}$ using unmatched charts (i.e., $\mathbf{x}\not\in U_{j}$ for some $j$ ) as illustrated in Figure 5.

An input $\mathbf{x}$ can belong to multiple charts, and the chart determination sub-network determines all these charts. This can be realized by compositing an indicator function and the squared Euclidean distance

[TABLE]

for $i=1,\dots,C_{\mathcal{M}}$ . The squared distance $d_{i}^{2}(\mathbf{x})$ is a sum of univariate quadratic functions, thus, we can apply Lemma 1 to approximate $d_{i}^{2}(\mathbf{x})$ by ReLU networks. Denote $\widehat{h}_{\textrm{sq}}$ as an approximation of the quadratic function $x^{2}$ on $[0,1]$ with an approximation error $\nu$ . Then we define

[TABLE]

as an approximation of $d_{i}^{2}(\mathbf{x})$ . The approximation error is $\lVert\widehat{d}_{i}^{2}-d_{i}^{2}\rVert_{\infty}\leq 4B^{2}D\nu$ , by the triangle inequality. We consider an approximation of the indicator function $\mathds{1}(x\in[0,r^{2}])$ as in Figure 6:

[TABLE]

where $\Delta$ ( $\Delta\geq 8B^{2}D\nu$ ) will be chosen later according to the accuracy $\epsilon$ .

To implement $\widehat{\mathds{1}}_{\Delta}(a)$ , we consider a basic step function $g=2\textrm{ReLU}(x-0.5(r^{2}-4B^{2}D\nu))-2\textrm{ReLU}(x-r^{2}+4B^{2}D\nu)$ . It is straightforward to check

[TABLE]

Let $\widehat{\mathds{1}}_{\Delta}=1-\frac{1}{r^{2}-4B^{2}D\nu}g_{k}$ . It suffices to choose $k$ satisfying $(1-\frac{1}{2^{k}})(r^{2}-4B^{2}D\nu)\geq r^{2}-\Delta+4B^{2}D\nu$ , which yields $k=\left\lceil\log\frac{r^{2}}{\Delta}\right\rceil$ . We use $\widehat{\mathds{1}}_{\Delta}\circ\widehat{d}_{i}^{2}$ to approximate the indicator function on $U_{i}$ :

•

if $\mathbf{x}\not\in U_{i}$ , i.e., $d_{i}^{2}(\mathbf{x})\geq r^{2}$ , we have $\widehat{\mathds{1}}_{\Delta}\circ\widehat{d}_{i}^{2}(\mathbf{x})=0$ ;

•

if $\mathbf{x}\in U_{i}$ and $d_{i}^{2}(\mathbf{x})\leq r^{2}-\Delta$ , we have $\widehat{\mathds{1}}_{\Delta}\circ\widehat{d}_{i}^{2}(\mathbf{x})=1$ .

We remark that although the approximate indicator function $\widehat{\mathds{1}}_{\Delta}$ is a piecewise linear function with two breakpoints, we implement it using a deep neural network to control the range of weight parameters in the network. Otherwise, the parameter upper bound can be as large as $1/\Delta$ due to the steep slope in $\widehat{\mathds{1}}_{\Delta}$ , which undermines the statistical theory.

Step 4. Taylor approximation. In each chart $(U_{i},\phi_{i})$ , we locally approximate $f$ using Taylor polynomials of order $n$ as shown in Figure 7. Specifically, we decompose $f$ as

[TABLE]

where $\rho_{i}$ is an element in a $C^{\infty}$ partition of unity on $\mathcal{M}$ which is supported inside $U_{i}$ . The existence of such a partition of unity is guaranteed by Proposition 1. Since $\mathcal{M}$ is a compact smooth manifold and $\rho_{i}$ is $C^{\infty}$ , $f_{i}$ preserves the regularity (smoothness) of $f$ such that $f_{i}\in\mathcal{H}^{s,\alpha}(\mathcal{M})$ for $i=1,\dots,C_{\mathcal{M}}$ .

Lemma 3.

Suppose Assumption 3 holds. For $i=1,\dots,C_{\mathcal{M}}$ , the function $f_{i}$ is Hölder continuous on $\mathcal{M}$ , in the sense that there exists a Hölder coefficient $L_{i}$ depending on $d,$ the upper bounds of derivatives of the partition of unity $\rho_{i}$ and coordinate system $\phi_{i}$ , up to order $s$ , such that for any $|\mathbf{s}|=s$ , we have

[TABLE]

Proof Sketch.

We provide a sketch here. More details are deferred to Appendix B.2. Without loss of generality, suppose Assumption 3 holds with the atlas chosen in Step 1. Denote $g_{1}=f\circ\phi_{i}^{-1}$ and $g_{2}=\rho_{i}\circ\phi_{i}^{-1}$ . By the Leibniz rule, we have

[TABLE]

Consider each term in the sum: for any $\mathbf{x}_{1},\mathbf{x}_{2}\in U_{i}$ ,

[TABLE]

Here $\lambda_{i}$ and $\mu_{i}$ are uniform upper bounds on the derivatives of $g_{1}$ and $g_{2}$ with order up to $s$ , respectively. The quantities $\theta_{i,\alpha}$ and $\beta_{i,\alpha}$ in the last inequality above is chosen as follows: by the mean value theorem, we have

[TABLE]

where the last inequality is due to the fact that $\left\lVert\phi_{i}(\mathbf{x}_{1})-\phi_{i}(\mathbf{x}_{2})\right\rVert_{2}\leq b_{i}\left\lVert V_{i}\right\rVert\left\lVert\mathbf{x}_{1}-\mathbf{x}_{2}\right\rVert_{2}\leq 2r$ . Then we set $\theta_{i,\alpha}=\sqrt{d}\mu_{i}(2r)^{1-\alpha}$ and by a similar argument, we set $\beta_{i,\alpha}=\sqrt{d}\lambda_{i}(2r)^{1-\alpha}$ . We complete the proof by taking $L_{i}=2^{s+1}\sqrt{d}\lambda_{i}\mu_{i}(2r)^{1-\alpha}$ . ∎

Lemma 3 is crucial for the error estimation in the local approximation of $f_{i}\circ\phi_{i}^{-1}$ by Taylor polynomials. This error estimate is given in the following theorem, where some of the proof techniques are from Theorem $1$ in Yarotsky (2017).

Theorem 3.

Let $f_{i}=f\rho_{i}$ as in Step 4. For any $\delta\in(0,1)$ , there exists a ReLU network structure that, if the weight parameters are properly chosen, the network yields an approximation of $f_{i}\circ\phi_{i}^{-1}$ uniformly with an $L_{\infty}$ error $\delta$ . Such a network has

no more than $c_{1}\left(\log\frac{1}{\delta}+1\right)$ layers, with width bounded by $c_{2}\delta^{-d/(s+\alpha)}$ , 2. 2.

at most $c_{3}\delta^{-\frac{d}{s+\alpha}}\left(\log\frac{1}{\delta}+1\right)$ neurons and weight parameters, with the range of weight parameters bounded by $\kappa=c_{4}\max\{1,\sqrt{d}\}$ ,

where $c_{1},c_{2},c_{3}$ depend on $s,d$ , $\tau$ , and the upper bound of derivatives of $f_{i}\circ\phi_{i}^{-1}$ up to order $s$ , and $c_{4}$ depends on the upper bound of the derivatives of $\rho_{i}$ ’s up to order $s$ .

Proof Sketch.

The detailed proof is provided in Appendix B.3. The proof consists of two steps:

Approximate $f_{i}\circ\phi_{i}^{-1}$ using a weighted sum of Taylor polynomials; 2. 2.

Implement the weighted sum of Taylor polynomials using ReLU networks.

Specifically, we set up a uniform grid and divide $[0,1]^{d}$ into small cubes, and then approximate $f_{i}\circ\phi_{i}^{-1}$ by its $s$ -th order Taylor polynomial in each cube. To implement such polynomials by ReLU networks, we recursively apply the multiplication $\widehat{\times}$ operator in Corollary 1, since these polynomials are sums of the products of different variables. ∎

Step 5. Estimating the total error. We have collected all the ingredients to implement the entire ReLU network to approximate $f$ on $\mathcal{M}$ . Recall that the network structure consists of 3 main sub-networks as demonstrated in Figure 3. Let $\widehat{\times}$ be an approximation to the multiplication operator in the pairing sub-network with error $\eta$ . Accordingly, the function given by the whole network is

[TABLE]

where $\widetilde{f}_{i}$ is the approximation of $f_{i}\circ\phi_{i}^{-1}$ using Taylor polynomials in Theorem 3. The total error can be decomposed into three components according to Lemma 4 below. We denote $\mathds{1}(\mathbf{x}\in U_{i})$ as the indicator function of $U_{i}$ . Let the approximation errors of the multiplication operation $\widehat{\times}$ and the local Taylor polynomial in Theorem 3 be $\eta$ and $\delta$ , respectively.

Lemma 4.

For any $i=1,\dots,C_{\mathcal{M}}$ , we have $\lVert\widetilde{f}-f\rVert_{\infty}\leq\sum_{i=1}^{C_{\mathcal{M}}}(A_{i,1}+A_{i,2}+A_{i,3})$ , where

[TABLE]

Lemma 4 is proved in Appendix B.4. In order to achieve an $\epsilon$ total approximation error, i.e., $\lVert f-\widetilde{f}\rVert_{\infty}\leq\epsilon$ , we need to control the errors in the three sub-networks. In other words, we need to decide $\nu$ for $\widehat{d}_{i}^{2}$ , $\Delta$ for $\widehat{\mathds{1}}_{\Delta}$ , $\delta$ for $\widetilde{f}_{i}$ , and $\eta$ for $\widehat{\times}$ . Note that $A_{i,1}$ is the error from the pairing sub-network, $A_{i,2}$ is the approximation error in the Taylor approximation sub-network, and $A_{i,3}$ is the error from the chart determination sub-network. The error bounds on $A_{i,1},A_{i,2}$ are straightforward from the constructions of $\widehat{\times}$ and $\widehat{f}_{i}$ . The estimate of $A_{i,3}$ involves some technical analysis since $\lVert\widehat{\mathds{1}}_{\Delta}\circ\widehat{d}_{i}^{2}-\mathds{1}(\mathbf{x}\in U_{i})\rVert_{\infty}=1$ . Note that we have

[TABLE]

whenever $\left\lVert\mathbf{x}-\mathbf{c}_{i}\right\rVert_{2}^{2}<r^{2}-\Delta$ or $\left\lVert\mathbf{x}-\mathbf{c}_{i}\right\rVert_{2}^{2}>r^{2}$ . Therefore, we only need to prove that $|f_{i}(\mathbf{x})|$ is sufficiently small in the shell region

[TABLE]

We bound the maximum of $f_{i}$ on $\mathcal{K}_{i}$ using a first-order Taylor expansion. Since $f_{i}$ vanishes at the boundary of $U_{i}$ due to the partition of unity $\rho_{i}$ , we can show that $\sup_{\mathbf{x}\in\mathcal{K}_{i}}|f_{i}(\mathbf{x})|$ is proportional to the width $\Delta$ of $\mathcal{K}_{i}$ . In particular, there exists a constant $c$ depending on $f_{i}$ ’s and $\phi_{i}$ ’s such that

[TABLE]

Then (6) immediately implies the upper bound on $A_{i,3}$ . The formal statement of (6) and its proof are deferred to Lemma 8 and Appendix B.5.

Given Lemma 4, we choose

[TABLE]

so that the approximation error is bounded by $\epsilon$ . Moreover, we choose

[TABLE]

to guarantee $\Delta>8B^{2}D\nu$ so that the definition of $\widehat{\mathds{1}}_{\Delta}$ is valid.

Finally we quantify the size of the ReLU network. Recall that the chart determination sub-network has $c_{1}\log\frac{1}{\nu}$ layers, the Taylor approximation sub-network has $c_{2}\log\frac{1}{\delta}$ layers, and the pairing sub-network has $c_{3}\log\frac{1}{\eta}$ layers. Here $c_{2}$ depends on $d,s,f$ , and $c_{1},c_{3}$ are absolute constants. Combining these with (7) and (8) yields the depth in Theorem 1. By a similar argument, we can obtain the number of neurons and weight parameters. A detailed analysis is given in Appendix B.6.

5 Proof of the Statistical Estimation Theory

In the proof of Theorem 2, we decompose the mean squared error of the estimator $\widehat{f}_{n}$ into a squared bias term and a variance term. We bound the bias and variance separately, where the bias is tackled using the approximation theory (Theorem 1), and the variance is bounded using the metric entropy arguments (van der Vaart and Wellner, 1996; Györfi et al., 2006). We begin with an oracle-type decomposition of the $L_{2}$ risk, in which we introduce the empirical $L_{2}$ risk as the intermediate term:

[TABLE]

where $T_{1}$ reflects the squared bias of using neural networks for estimating $f_{0}$ and $T_{2}$ is the variance term. We slightly abuse the notation $i$ to denote the index of samples.

5.1 Bias Characterization — Bounding $T_{1}$

Since $T_{1}$ is the empirical $L_{2}$ risk of $\widehat{f}_{n}$ evaluated on the samples $S_{n}$ , we relate $T_{1}$ to the empirical risk (3) by rewriting $f_{0}(\mathbf{x}_{i})=y_{i}-\xi_{i}$ . Substituting into $T_{1}$ , we derive the following decomposition,

[TABLE]

Equality $(i)$ is obtained by expanding the square, where the cross term $\mathbb{E}[\xi_{i}y_{i}]=\mathbb{E}[\xi_{i}(f_{0}(\mathbf{x}_{i})+\xi_{i})]=\mathbb{E}[\xi_{i}^{2}]$ due to the independence between $\mathbf{x}_{i}$ and $\xi_{i}$ . Inequality $(ii)$ invokes the Jensen’s inequalty to pass the expectation. To obtain term $(A)$ , we expand $(f(\mathbf{x}_{i})-y_{i})^{2}=(f(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i})-\xi_{i})^{2}$ , and observe the cancellation of $-\xi_{i}^{2}$ . Note that term $(A)$ is the squared approximation error of neural networks, and we will tackle it later using Theorem 1. We bound term $(B)$ by quantifying the complexity of the network class $\mathcal{F}(R,\kappa,L,p,K)$ . A precise upper bound of $T_{1}$ is given in the following lemma, whose proof follows a similar argument in Schmidt-Hieber (2017, Lemma 4).

Lemma 5.

Fix the neural network class $\mathcal{F}(R,\kappa,L,p,K)$ . For any constant $\delta\in(0,2R)$ , we have

[TABLE]

where $\mathcal{N}(\delta,\mathcal{F}(R,\kappa,L,p,K),\left\lVert\cdot\right\rVert_{\infty})$ denotes the $\delta$ -covering number of $\mathcal{F}(R,\kappa,L,p,K)$ with respect to the $\ell_{\infty}$ norm, i.e., there exists a discretization of $\mathcal{F}(R,\kappa,L,p,K)$ into $\mathcal{N}(\delta,\mathcal{F}(R,\kappa,L,p,K),\left\lVert\cdot\right\rVert_{\infty})$ distinct elements, such that for any $f\in\mathcal{F}$ , there is $\bar{f}$ in the discretization satisfying $\left\lVert\bar{f}-f\right\rVert_{\infty}\leq\epsilon$ .

Proof Sketch.

Given the derivation in (9), we need to bound term $(B)$ . We discretize the neural network class $\mathcal{F}(R,\kappa,L,p,K)$ as $\{f^{*}_{i}\}_{i=1}^{\mathcal{N}(\delta,\mathcal{F}(R,\kappa,L,p,K),\left\lVert\cdot\right\rVert_{\infty})}$ . By the definition of covering, there exists $f^{*}$ such that $\lVert\widehat{f}_{n}-f^{*}\rVert_{\infty}\leq\delta$ . Denoting $\left\lVert f-f_{0}\right\rVert_{n}=\frac{1}{n}\sum_{i=1}^{n}(f(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i}))^{2}$ , we cast $(B)$ into

[TABLE]

where $(i)$ follows from Hölder’s inequality and $(ii)$ is obtained by some algebraic manipulation. To break the dependence between $f^{*}$ and the samples, we replace $f^{*}$ by any $f^{*}_{j}$ in the $\delta$ -covering and observe that $\left|\frac{\sum_{i=1}^{n}\xi_{i}(f^{*}(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i}))}{\sqrt{n}\left\lVert f^{*}-f_{0}\right\rVert_{n}}\right|\leq\max_{j}\left|\frac{\sum_{i=1}^{n}\xi_{i}(f^{*}_{j}(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i}))}{\sqrt{n}\lVert f^{*}_{j}-f_{0}\rVert_{n}}\right|$ . Applying the Cauchy-Schwarz inequality, we can show

[TABLE]

where $z_{j}=\left|\frac{\sum_{i=1}^{n}\xi_{i}(f^{*}(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i}))}{\sqrt{n}\left\lVert f^{*}-f_{0}\right\rVert_{n}}\right|$ . Given $\mathbf{x}_{1},\dots,\mathbf{x}_{n}$ , we note that $z_{j}$ is a sub-Gaussian random variable with parameter $\sigma$ (i.e., its variance is bounded by $\sigma^{2}$ ). It is well established in the existing literature on empirical processes (van der Vaart and Wellner, 1996) that the maximum of a collection of squared sub-Gaussian random variables satisfies

[TABLE]

Substituting the above inequality into $(B)$ and combining $(A)$ and $(B)$ , we have

[TABLE]

Some manipulation gives rise to the desired result

[TABLE]

See proof details in Appendix C.1. ∎

5.2 Variance Characterization — Bounding $T_{2}$

We observe that $T_{2}$ is the difference between the population $L_{2}$ risk of $\widehat{f}_{n}$ and its empirical counterpart. However, bounding such a difference is distinct from conventional concentration results due to the scaling factor $2$ before the empirical risk. In particular, we split the empirical risk evenly into two parts, and bound one part using its higher-order moment (fourth moment). Using Bernstein-type inequality allows us to establish a $1/n$ convergence rate of $T_{2}$ ; the corresponding upper bound is presented in the following lemma.

Lemma 6.

For any constant $\delta\in(0,2R)$ , $T_{2}$ satisfies

[TABLE]

Proof Sketch.

The detailed proof is deferred to Appendix C.2. For notational simplicity, we denote $\widehat{g}(\mathbf{x})=(\widehat{f}_{n}(\mathbf{x})-f_{0}(\mathbf{x}))^{2}$ and $\left\lVert\widehat{g}\right\rVert_{\infty}\leq 4R^{2}$ . Applying the inequality $\int_{\mathcal{M}}\widehat{g}^{2}d\mathcal{D}_{x}(\mathbf{x})\leq 4R^{2}\int_{\mathcal{M}}\widehat{g}d\mathcal{D}_{x}(\mathbf{x})$ (Barron, 1991), we rewrite $T_{2}$ as

[TABLE]

We now utilize ghost samples of $\mathbf{x}$ to bound $T_{2}$ , which is a common technique in existing literature on nonparametric statistics (van der Vaart and Wellner, 1996; Györfi et al., 2006). Specifically, let $\bar{\mathbf{x}}_{i}$ ’s be independent replications of $\mathbf{x}_{i}$ ’s. We bound $T_{2}$ as

[TABLE]

where $\mathcal{G}=\{g=(f-f_{0})^{2}~{}|~{}f\in\mathcal{F}(R,\kappa,L,p,K)\}$ . We use the shorthand $\mathbb{E}_{\mathbf{x},\bar{\mathbf{x}}}[\cdot]$ to denote the double integral $\int_{\mathcal{M}}\int_{\mathcal{M}}\cdot d\mathcal{D}_{x}(\mathbf{x})d\mathcal{D}_{x}(\bar{\mathbf{x}})$ with respect to the joint distribution of $(\mathbf{x},\bar{\mathbf{x}})$ . The last inequality holds due to Jensen’s inequality. Note here $g^{2}(\mathbf{x})+g^{2}(\bar{\mathbf{x}})$ contributes as the variance term of $g(\bar{\mathbf{x}}_{i})-g(\mathbf{x}_{i})$ , which yields a fast convergence of $T_{2}$ as $n$ grows.

Similar to bounding $T_{1}$ , we discretize the function space $\mathcal{G}$ using a $\delta$ -covering denoted by $\mathcal{G}^{*}$ . This allows us to replace the supremum by the maximum over a finite set:

[TABLE]

We can bound the above maximum by the Bernstein’s inequality, which yields

[TABLE]

The last step is to relate the covering number of $\mathcal{G}$ to that of $\mathcal{F}(R,\kappa,L,p,K)$ . Specifically, consider any $g_{1},g_{2}\in\mathcal{G}$ with $g_{1}=(f_{1}-f_{0})^{2}$ and $g_{2}=(f_{2}-f_{0})^{2}$ , respectively. We can derive

[TABLE]

Therefore, the inequality $\mathcal{N}(\delta,\mathcal{G},\left\lVert\cdot\right\rVert_{\infty})\leq\mathcal{N}(\delta/4R,\mathcal{F}(R,\kappa,L,p,K),\left\lVert\cdot\right\rVert_{\infty})$ holds, which implies

[TABLE]

The proof is complete. ∎

5.3 Covering Number of Neural Networks

The upper bounds of $T_{1}$ and $T_{2}$ in Lemmas 5 and 6 both depend on the covering number of the network class $\mathcal{F}(R,\kappa,L,p,K)$ . In this section, we provide an upper bound on the covering number $\mathcal{N}(\delta,\mathcal{F}(R,\kappa,L,p,K),\left\lVert\cdot\right\rVert_{\infty})$ for a given a resolution $\delta>0$ . Since each weight parameter in the network is bounded by a constant $\kappa$ , we construct a covering by partitioning the range of each weight parameter into a uniform grid. By choosing a proper grid size, we show the following lemma.

Lemma 7.

Given $\delta>0$ , the $\delta$ -covering number of the neural network class $\mathcal{F}(R,\kappa,L,p,K)$ satisfies

[TABLE]

Proof Sketch.

Consider $f,f^{\prime}\in\mathcal{F}(R,\kappa,L,p,K)$ with each weight parameter differing at most $h$ . By an induction on the number of layers in the network, we show that the $\ell_{\infty}$ norm of the difference $f-f^{\prime}$ scales as

[TABLE]

As a result, to achieve a $\delta$ -covering, it suffices to choose $h$ such that $hL(pB+2)(\kappa p)^{L-1}=\delta$ . Moreover, there are ${Lp^{2}\choose{K}}\leq(Lp^{2})^{K}$ different choices of $K$ non-zero entries out of $Lp^{2}$ weight parameters. Therefore, the covering number is bounded by

[TABLE]

The detailed proof is provided in Appendix C.3. ∎

5.4 Bias-Variance Trade-off

We are ready to finish the proof of Theorem 2. Combining the upper bounds of $T_{1}$ in Lemma 5 and $T_{2}$ in Lemma 6 together and substituting the covering number (10), we obtain

[TABLE]

It suffices to choose $\delta=1/n$ , which gives rise to

[TABLE]

where we also plug in the covering number upper bound in Lemma 10. We further set the approximation error as $\epsilon$ , i.e., $\inf_{f\in\mathcal{F}(R,\kappa,L,p,K)}\lVert f(\mathbf{x})-f_{0}(\mathbf{x})\rVert_{\infty}\leq\epsilon$ . Theorem 1 suggests that we choose $L=\widetilde{O}(\log\frac{1}{\epsilon})$ , $p=\widetilde{O}(\epsilon^{-\frac{d}{s+\alpha}})$ , and $K=\widetilde{O}\left(\epsilon^{-\frac{d}{s+\alpha}}\log\frac{1}{\epsilon}+D\log\frac{1}{\epsilon}\right)$ . Substituting $L$ , $p$ , and $K$ into (5.4), we have

[TABLE]

To balance the error terms, we pick $\epsilon$ satisfying $\epsilon^{2}=\frac{1}{n}\epsilon^{-\frac{d}{s+\alpha}}$ , which gives $\epsilon=n^{-\frac{s+\alpha}{d+2(s+\alpha)}}$ . The proof of Theorem 2 is complete by plugging in $\epsilon=n^{-\frac{s+\alpha}{d+2(s+\alpha)}}$ and rearranging the terms.

6 Conclusion

We study nonparametric regression of functions supported on a $d$ -dimensional Riemannian manifold $\mathcal{M}$ isometrically embedded in $\mathbb{R}^{D}$ , using deep ReLU neural networks. Our result establishes an efficient statistical estimation theory for general regression functions including $C^{s}$ and Hölder functions supported on manifolds. We show that the $L_{2}$ loss for the estimation of $f_{0}\in\mathcal{H}^{s,\alpha}(\mathcal{M})$ converges in the order of $n^{-\frac{s+\alpha}{2(s+\alpha)+d}}$ . To obtain an $\epsilon$ -error for the estimation of $f_{0}$ , the sample complexity scales in the order of $\epsilon^{-\frac{2(s+\alpha)+d}{s+\alpha}}$ . This sample complexity depends on the intrinsic dimension $d$ , and demonstrates that deep neural networks are adaptive to low-dimensional geometric structures of data sets. Such results can be viewed as theoretical justifications for the empirical success of deep learning in various real-world applications where the data sets exhibit low-dimensional structures.

Acknowledgment

This work was supported by NSF DMS $1818751$ , NSF DMS 2012652, and NSF IIS-1717916.

Appendix A Proofs of the Preliminary Results in Section 4

A.1 Proof of Lemma 1

Proof.

We partition the interval $[0,1]$ uniformly into $2^{N}$ subintervals $I_{k}=[\frac{k}{2^{N}},\frac{k+1}{2^{N}}]$ for $k=0,\dots,2^{N}-1$ . We approximate $f(x)=x^{2}$ on these subintervals by a linear interpolation

[TABLE]

It is straightforward to check that $\widehat{f}_{k}$ meets $f$ at the endpoints $\frac{k}{2^{N}},\frac{k+1}{2^{N}}$ of $I_{k}$ .

We evaluate the approximation error of $\widehat{f}_{k}$ on the interval $I_{k}$ :

[TABLE]

Note that this approximation error does not depend on $k$ . Thus, in order to achieve an $\epsilon$ approximation error, we only need

[TABLE]

Since $2\log 2>1$ , we let $N=\left\lceil\log\frac{1}{\epsilon}\right\rceil$ and denote $f_{N}=\sum_{k=0}^{2^{N}-1}\widehat{f}_{k}\mathds{1}\{x\in I_{k}\}$ . We compute the increment from $f_{N-1}$ to $f_{N}$ for $x\in\left[\frac{k}{2^{N-1}},\frac{k+1}{2^{N-1}}\right]$ as

[TABLE]

We observe that $f_{N-1}-f_{N}$ is a triangular function on $\left[\frac{k}{2^{N-1}},\frac{k+1}{2^{N-1}}\right]$ . The maximum is $\frac{1}{2^{2N}}$ independent of $k$ attained at $x=\frac{2k+1}{2^{N}}$ . The minimum is [math] attained at the endpoints $\frac{k}{2^{N-1}},\frac{k+1}{2^{N-1}}$ . To implement $f_{N}$ , we consider a triangular function representable by a one-layer ReLU network:

[TABLE]

Denote by $g_{m}=g\circ g\circ\cdots\circ g$ the composition of totally $m$ functions $g$ . Observe that $g_{m}$ is a sawtooth function with $2^{m-1}$ peaks at $\frac{2k+1}{2^{m}}$ for $k=0,\dots,2^{m-1}-1$ , and we have $g_{m}\left(\frac{2k+1}{2^{m}}\right)=1$ for $k=0,\dots,2^{m-1}-1$ . Then we have $f_{N-1}-f_{N}=\frac{1}{2^{2N}}g_{N}$ . By induction, we have

[TABLE]

Therefore, $f_{N}$ can be implemented by a ReLU network of depth $\left\lceil\log\frac{1}{\epsilon}\right\rceil\leq\log\frac{1}{\epsilon}+1$ . Meanwhile, each layer consists of at most 3 neurons. Hence, the total number of neurons and weight parameters is no more than $c\log\frac{1}{\epsilon}$ for an absolute constant $c$ . ∎

A.2 Proof of Corollary 1

Proof.

Let $\widehat{f}_{\delta}$ be an approximation of the quadratic function on $[0,1]$ with error $\delta\in(0,1)$ . We set

[TABLE]

Now we determine $\delta$ . We bound the error of $\widehat{\times}$

[TABLE]

Thus, we pick $\delta=\frac{\epsilon}{2C^{2}}$ to ensure $\left|\widehat{\times}(x,y)-xy\right|\leq\epsilon$ for any inputs $x$ and $y$ . As shown in Lemma 1, we can implement $\widehat{f}_{\delta}$ using a ReLU network of depth at most $c^{\prime}\log\frac{1}{\delta}=c\log\frac{C^{2}}{\epsilon}$ with absolute constants $c^{\prime},c$ . The proof is complete. ∎

Appendix B Proof of Approximation Theory of ReLU Network (Theorem 1)

This section consists of the detailed proofs of Lemma 2, Lemma 3, local approximation theory Theorem 3, error decomposition in Lemma 4 and a technical Lemma 8 for bounding the error, as well as the configuration of the desired ReLU network class for universally approximating Hölder functions.

B.1 Proof of Lemma 2

Proof.

We first show ${\sf P}_{i}$ defined on $U_{i}$ is a homeomorphism, which implies $(U_{i},{\sf P}_{i})$ is a chart on the manifold. Then by Proposition 6.10 in Tu (2010), we conclude ${\sf P}_{i}$ is a diffeomorphism.

To show ${\sf P}_{i}$ is a homeomorphism on $U_{i}$ , we only need to show ${\sf P}_{i}$ has a continuous inverse. By Lemma 5.4 in Niyogi et al. (2008), the derivative of ${\sf P}_{i}$ is nonsingular in $U_{i}$ . The inverse function theorem implies that ${\sf P}_{i}$ is locally invertible in an open neighborhood $\mathcal{B}(\mathbf{c}_{i},c\tau)\bigcap\mathcal{M}$ for some constant $c>0$ . In the following, we show by contradiction that the constant $c\geq 1/4$ . Suppose not, there exist distinct points $\mathbf{a},\mathbf{b}\in U_{i}$ such that ${\sf P}_{i}(\mathbf{a})={\sf P}_{i}(\mathbf{b})$ with $\left\lVert\mathbf{a}-\mathbf{c}_{i}\right\rVert_{2}<\tau/4$ and $\left\lVert\mathbf{b}-\mathbf{c}_{i}\right\rVert_{2}<\tau/4$ . Using the triangle inequality, we obtain $\left\lVert\mathbf{a}-\mathbf{b}\right\rVert_{2}<\tau/2$ . Applying Proposition 6.3 in Niyogi et al. (2008), we derive

[TABLE]

Furthermore, using Proposition 6.2 in Niyogi et al. (2008), we lower bound the angle between the tangent spaces $T_{\mathbf{c}_{i}}(\mathcal{M})$ and $T_{\mathbf{a}}(\mathcal{M})$ by

[TABLE]

On the other hand, we consider a unit speed geodesic $\gamma(t)$ starting from $\mathbf{a}$ and ending at $\mathbf{b}$ , with $\gamma(0)=\mathbf{a}$ , $\gamma(d_{\mathcal{M}}(\mathbf{a},\mathbf{b}))=\mathbf{b}$ , and $\left\lVert\dot{\gamma}\right\rVert_{2}=1$ . Integration by parts yields

[TABLE]

Rearranging terms gives rise to

[TABLE]

where the last inequality follows from Proposition 6.1 in Niyogi et al. (2008). Dividing (B.2) by $d_{\mathcal{M}}(\mathbf{a},\mathbf{b})$ and plugging in $d_{\mathcal{M}}(\mathbf{a},\mathbf{b})\leq\tau$ , we have

[TABLE]

For any unit vector $\mathbf{v}\in T_{\mathbf{c}_{i}}(\mathcal{M})$ , we evaluate the inner product

[TABLE]

where $\left|\left\langle\frac{\mathbf{b}-\mathbf{a}}{d_{\mathcal{M}}(\mathbf{a},\mathbf{b})},\mathbf{v}\right\rangle\right|=0$ in equality $(i)$ , since ${\sf P}_{i}(\mathbf{a})={\sf P}_{i}(\mathbf{b})$ by our assertion. Combining (B.1) and (B.1), we obtain

[TABLE]

which is a contradiction. Therefore, we conclude that ${\sf P}_{i}$ is injective, and hence invertible on the local neighborhood $\mathcal{B}(\mathbf{c}_{i},\tau/4)\bigcap\mathcal{M}$ . The continuity of ${\sf P}_{i}$ follows from its definition, and the inverse map of a continuous map is also continuous. Therefore, ${\sf P}_{i}$ is a homeomorphism on $\mathcal{B}(\mathbf{c}_{i},r)\bigcap\mathcal{M}$ for $r\leq\tau/4$ .

The last step is to show ${\sf P}_{i}$ is also a diffeomorphism. We leverage the following proposition.

Proposition 3 (Proposition 6.10 in Tu (2010)).

If $(U,\phi)$ is a chart on a manifold $\mathcal{M}$ , then the coordinate map $\phi:U\mapsto\phi(U)$ is a diffeomorphism.

Since ${\sf P}_{i}$ is a homeomorphism, we deduce that $(U_{i},{\sf P}_{i})$ is a chart of $\mathcal{M}$ . Applying Proposition 3, we conclude that ${\sf P}_{i}$ is a diffeomorphism. ∎

B.2 Proof of Lemma 3

Proof.

Recall that we choose local coordinate neighborhood $U_{i}$ in Step 1 in Section 4. Let ${\sf P}_{i}$ be the projection onto the tangent space $T_{\mathbf{c}_{i}}(\mathcal{M})$ . Then $\{(U_{i},{\sf P}_{i})\}$ is an atlas of $\mathcal{M}$ . Without loss of generality, we assume that $\{(U_{i},{\sf P}_{i})\}$ verifies the Hölder condition in Definition 5. Now we rewrite $f_{i}\circ\phi_{i}^{-1}$ as

[TABLE]

By the definition of the partition of unity, we know $g_{2}$ is $C^{\infty}$ . This implies that $g_{2}$ is $(s+1)$ continuously differentiable. Since $\textrm{supp}(\rho_{i})$ is compact, the $k$ -th derivative of $g_{2}$ is uniformly bounded by $\lambda_{i,k}$ for any $k\leq s+1$ . Let $\lambda_{i}=\max_{k\leq n+1}\lambda_{i,k}$ . We have for any $|\mathbf{n}|\leq n$ and $\mathbf{x}_{1},\mathbf{x}_{2}\in U_{i}$ ,

[TABLE]

The last inequality follows from $\phi_{i}(\mathbf{x})=b_{i}(V_{i}^{\top}(\mathbf{x}-\mathbf{c}_{i})+\mathbf{u}_{i})$ and $\left\lVert V_{i}\right\rVert_{2}=1$ . Observe that $U_{i}$ is bounded, hence, we have $\left\lVert\mathbf{x}_{1}-\mathbf{x}_{2}\right\rVert_{2}^{1-\alpha}\leq(2r)^{1-\alpha}$ . Absorbing $\left\lVert\mathbf{x}_{1}-\mathbf{x}_{2}\right\rVert_{2}^{1-\alpha}$ into $\sqrt{d}\lambda_{i}b_{i}^{1-\alpha}$ , we have the derivative of $g_{2}$ is Hölder continuous. We denote $\beta_{i,\alpha}=\sqrt{d}\lambda_{i}b_{i}^{1-\alpha}(2r)^{1-\alpha}\leq\sqrt{d}\lambda_{i}(2r)^{1-\alpha}$ . Similarly, $g_{1}$ is $C^{s-1}$ by Assumption 3. Then there exists a constant $\mu_{i}$ such that the $k$ -th derivative of $g_{1}$ is uniformly bounded by $\mu_{i}$ for any $k\leq n-1$ . These derivatives are also Hölder continuous with coefficient $\theta_{i,\alpha}\leq\sqrt{d}\mu_{i}(2r)^{1-\alpha}$ .

By the Leibniz rule, for any $|\mathbf{s}|=s$ , we expand the $s$ -th derivative of $f_{i}\circ\phi_{i}^{-1}$ as

[TABLE]

Consider each summand in the above right-hand side. For any $\mathbf{x}_{1},\mathbf{x}_{2}\in U_{i}$ , we derive

[TABLE]

Observe that there are totally $2^{s}$ summands in the right hand side of (B.4). Therefore, for any $\mathbf{x}_{1},\mathbf{x}_{2}\in U_{i}$ and $|\mathbf{s}|=s$ , we have

[TABLE]

∎

B.3 Proof of Theorem 3

Proof.

The proof consists of two steps. We first approximate $f_{i}\circ\phi_{i}^{-1}$ by a Taylor polynomial, and then implement the Taylor polynomial using a ReLU network. To ease the analysis, we extend $f_{i}\circ\phi_{i}^{-1}$ to the whole cube $[0,1]^{d}$ by assigning $f_{i}\circ\phi_{i}^{-1}(\mathbf{x})=0$ for $\phi_{i}(\mathbf{x})\in[0,1]^{d}\setminus\phi_{i}(U_{i})$ . It is straightforward to check that this extension preserves the regularity of $f_{i}\circ\phi_{i}^{-1}$ , since $f_{i}$ vanishes on the complement of the compact set $\textrm{supp}(\rho_{i})\subset U_{i}$ . For notational simplicity, we denote $f_{i}^{\phi}=f_{i}\circ\phi_{i}^{-1}$ with the extension. Accordingly, Lemma 3 can be extended to the whole cube $[0,1]^{d}$ without changing its proof, i.e., for any $\mathbf{x}_{1},\mathbf{x}_{2}\in[0,1]^{d}$ and $|\mathbf{s}|=s$ , we have

[TABLE]

Step 1. We define a trapezoid function

[TABLE]

Note that we have $\left\lVert\psi\right\rVert_{\infty}=1$ . Let $N$ be a positive integer, we form a uniform grid on $[0,1]^{d}$ by dividing each coordinate into $N$ subintervals. We then consider a partition of unity on these grid defined by

[TABLE]

We can check that $\sum_{\mathbf{m}}\zeta_{\mathbf{m}}(\mathbf{x})=1$ as in Figure 8.

We also observe that $\textrm{supp}(\zeta_{\mathbf{m}})=\left\{\mathbf{x}:\left|x_{k}-\frac{m_{k}}{N}\right|\leq\frac{2}{3N},k=1,\dots,d\right\}\subset\left\{\mathbf{x}:\left|x_{k}-\frac{m_{k}}{N}\right|\leq\frac{1}{N},k=1,\dots,d\right\}$ . We use the slightly enlarged support set of length $2/N$ to simplify the constant computation. Now we construct a Taylor polynomial of degree $s$ for approximating $f_{i}^{\phi}$ at $\frac{\mathbf{m}}{N}$ :

[TABLE]

Define $\bar{f}_{i}=\sum_{\mathbf{m}\in\{0,\dots,N\}^{d}}\zeta_{\mathbf{m}}P_{\mathbf{m}}$ . We bound the approximation error $\left\lVert\bar{f}_{i}-f_{i}^{\phi}\right\rVert_{\infty}$ :

[TABLE]

Here $\mathbf{y}$ is the linear interpolation of $\frac{\mathbf{m}}{N}$ and $\mathbf{x}$ , determined by the Taylor remainder, and inequality $(i)$ follows from the Taylor expansion of $f_{i}^{\phi}$ around $\mathbf{m}/N$ . Note that only $s$ -th order derivative remains in step $(i)$ and there are at most $d^{s}$ terms. Inequality $(ii)$ is obtained by the Hölder continuity in the inequality (B.5).

By setting

[TABLE]

we get $N\geq\left(\frac{\sqrt{d}\mu_{i}\lambda_{i}(2r)^{1-\alpha}2^{d+s+2}d^{s+\alpha/2}}{\delta s!}\right)^{\frac{1}{s+\alpha}}$ . Accordingly, the approximation error is bounded by $\lVert\bar{f}_{i}-f_{i}^{\phi}\rVert_{\infty}\leq\frac{\delta}{2}$ .

Step 2. We next implement $\widetilde{f}_{i}$ by a ReLU network that approximates $\bar{f}_{i}$ up to an error $\frac{\delta}{2}$ . We denote

[TABLE]

where $a_{\mathbf{m},\mathbf{s}}=\frac{D^{\mathbf{s}}f_{i}^{\phi}}{\mathbf{s}!}\bigg{|}_{\mathbf{x}=\frac{\mathbf{m}}{N}}$ . Then we rewrite $\bar{f}_{i}$ as

[TABLE]

Note that (B.6) is a linear combination of products $\zeta_{\mathbf{m}}\left(\mathbf{x}-\frac{\mathbf{m}}{N}\right)^{\mathbf{s}}$ . Each product involves at most $d+n$ univariate terms: $d$ terms for $\zeta_{\mathbf{m}}$ and $n$ terms for $\left(\mathbf{x}-\frac{\mathbf{m}}{N}\right)^{\mathbf{s}}$ . We recursively apply Corollary 1 to implement the product. Specifically, let $\widehat{\times}_{\epsilon}$ be the approximation of the product operator in Corollary 1 with error $\epsilon$ , which will be chosen later. Consider the following chain application of $\widehat{\times}_{\epsilon}$ :

[TABLE]

Now we estimate the error of the above approximation. Note that we have $|\psi(3Nx_{k}-3m_{k})|\leq 1$ and $\left|x_{k}-\frac{m_{k}}{N}\right|\leq 1$ for all $k\in\{1,\dots,d\}$ and $\mathbf{x}\in[0,1]^{d}$ . We then have

[TABLE]

Moreover, we have $\widetilde{f}_{\mathbf{m},\mathbf{s}}(\mathbf{x})=\zeta_{\mathbf{m}}\left(\mathbf{x}-\frac{\mathbf{m}}{N}\right)^{\mathbf{s}}=0$ , if $\mathbf{x}\not\in\textrm{supp}(\zeta_{\mathbf{m}})$ . Now we define

[TABLE]

The approximation error is bounded by

[TABLE]

We choose $\epsilon=\frac{\delta}{\lambda_{i}\mu_{i}2^{d+s+2}d^{s}(d+s)}$ , so that $\lVert\bar{f}_{i}-\widetilde{f}_{i}\rVert_{\infty}\leq\frac{\delta}{2}$ . Thus, we eventually have $\lVert\widetilde{f}_{i}-f_{i}^{\phi}\rVert_{\infty}\leq\delta$ . Now we compute the depth and computational units for implement $\widetilde{f}_{i}$ . $\widetilde{f}_{i}$ can be implemented by a collection of parallel sub-networks that compute each $\widetilde{f}_{\mathbf{m},\mathbf{s}}$ . The total number of parallel sub-networks is bounded by $d^{s}(N+1)^{d}$ . For each sub-network, we observe that $\psi$ can be exactly implemented by a single layer ReLU network, i.e., $\psi(x)=\textrm{ReLU}(x+2)-\textrm{ReLU}(x+1)-\textrm{ReLU}(x-1)+\textrm{ReLU}(x-2)$ . Corollary 1 shows that $\widehat{\times}_{\epsilon}$ can be implemented by a depth $c_{1}\log\frac{1}{\epsilon}$ ReLU network. Therefore, the whole network for implementing $\widetilde{f}_{i}$ has no more than $c^{\prime}_{1}\left(\log\frac{1}{\epsilon}+1\right)$ layers with width bounded by $O(d^{s}(N+1)^{d})$ and $c^{\prime}_{1}d^{s}(N+1)^{d}\left(\log\frac{1}{\epsilon}+1\right)$ neurons and weight parameters. With $\epsilon=\frac{\delta}{\lambda_{i}\mu_{i}2^{d+s+2}d^{s}(d+s)}$ and $N=\Big{\lceil}\big{(}\frac{\mu_{i}\lambda_{i}(2r)^{1-\alpha}2^{d+s+2}d^{s+\alpha/2}}{\delta s!}\big{)}^{\frac{1}{s+\alpha}}\Big{\rceil}$ , we obtain that the whole network has no more than $L=c_{1}\log\frac{1}{\delta}$ layers, with width bounded by $p=c_{2}\delta^{-\frac{d}{s+\alpha}}$ , and at most $K=c_{2}\delta^{-\frac{d}{s+\alpha}}\left(\log\frac{1}{\delta}+1\right)$ neurons and weight parameters, for constants $c_{1},c_{2},c_{3}$ depending on $d,s,\tau$ , and upper bound of derivatives of $f_{i}\circ\phi_{i}^{-1}$ , up to order $s$ . Lastly, from (B.6), we see each parameter has a range bounded by the upper bound of derivatives of $f_{i}\circ\phi_{i}^{-1}$ up to order $s$ — scales as $\sqrt{d}$ as in (B.5). ∎

B.4 Proof of Lemma 4

Proof.

We expand the estimation error as

[TABLE]

The first two terms $A_{i,1},A_{i,2}$ are straightforward to handle, since by the construction we have

[TABLE]

By Lemma 8, we have $\max_{\mathbf{x}\in\mathcal{K}_{i}}|f_{i}(\mathbf{x})|\leq\frac{c(\pi+1)}{r(1-r/\tau)}\Delta$ for a constant $c$ depending on $f_{i}$ . Then we bound $A_{i,3}$ as

[TABLE]

∎

B.5 Helper Lemma for Bounding $A_{i,3}$ and Its Proof

Lemma 8.

For any $i=1,\dots,C_{\mathcal{M}}$ , denote

[TABLE]

Then there exists a constant $c$ depending on the upper bounds of the first derivatives of the partition of unity $\rho_{i}$ ’s and coordinate system $\phi_{i}$ ’s such that

[TABLE]

Proof.

We extend $f_{i}\circ\phi_{i}^{-1}$ to the whole cube $[0,1]^{d}$ as in the proof of Theorem 3. We also have $f_{i}(\mathbf{x})=0$ for $\left\lVert\mathbf{x}-\mathbf{c}_{i}\right\rVert_{2}=r$ . By the first order Taylor expansion, for any $\mathbf{x},\mathbf{y}\in U_{i}$ , we have

[TABLE]

where $\mathbf{z}$ is a linear interpolation of $\phi_{i}(\mathbf{x})$ and $\phi_{i}(\mathbf{y})$ satisfying the mean value theorem. Since $f_{i}\circ\phi_{i}^{-1}$ is $C^{s}$ in $[0,1]^{d}$ , the first derivative is uniformly bounded, i.e., $\left\lVert\nabla f_{i}\circ\phi_{i}^{-1}(\mathbf{z})\right\rVert_{2}\leq\alpha_{i}$ for any $\mathbf{z}\in[0,1]^{d}$ . Let $\mathbf{y}\in U_{i}$ satisfying $f_{i}(\mathbf{y})=0$ . In order to bound the function value for any $\mathbf{x}\in\mathcal{K}_{i}$ , we only need to bound the Euclidean distance between $\mathbf{x}$ and $\mathbf{y}$ . More specifically, for any $\mathbf{x}\in\mathcal{K}_{i}$ , we need to show that there exists $\mathbf{y}\in U_{i}$ satisfying $f_{i}(\mathbf{y})=0$ , such that $\left\lVert\mathbf{x}-\mathbf{y}\right\rVert_{2}$ is sufficiently small.

Before continuing with the proof, we introduce some notations. Let $\gamma(t)$ be a geodesic on $\mathcal{M}$ parameterized by the arc length. In the following context, we use $\dot{\gamma}$ and $\ddot{\gamma}$ to denote the first and second derivatives of $\gamma$ with respect to $t$ . By the definition of geodesic, we have $\left\lVert\dot{\gamma}(t)\right\rVert_{2}=1$ (unit speed) and $\ddot{\gamma}(t)\perp\dot{\gamma}(t)$ .

Without loss of generality, we shift $\mathbf{c}_{i}$ to $\mathbf{0}$ . We consider a geodesic starting from $\mathbf{x}$ with initial “velocity” $\dot{\gamma}(0)=\mathbf{v}$ in the tangent space of $\mathcal{M}$ at $\mathbf{x}$ . To utilize polar coordinate, we define two auxiliary quantities: $\ell(t)=\left\lVert\gamma(t)\right\rVert_{2}$ and $\theta(t)=\arccos\frac{\gamma(t)^{\top}\dot{\gamma}(t)}{\left\lVert\gamma(t)\right\rVert_{2}}\in[0,\pi]$ . As can be seen in Figure 9, $\ell$ and $\theta$ have clear geometrical interpretations: $\ell$ is the radial distance from the center $\mathbf{c}_{i}$ , and $\theta$ is the angle between the velocity and $\gamma(t)$ .

Suppose $\mathbf{y}=\gamma(T)$ , we need to upper bound $T$ . Note that $\ell(T)-\ell(0)\leq r-\sqrt{r^{2}-\Delta}\leq\Delta/r$ . Moreover, observe that the derivative of $\ell$ is $\dot{\ell}(t)=\cos\theta(t)$ , since $\gamma$ has unit speed. It suffices to find a lower bound on $\dot{\ell}(t)=\cos\theta(t)$ so that $T\leq\frac{\Delta}{r\inf_{t}\dot{\ell}(t)}$ .

We immediately have the second derivative of $\ell$ as $\ddot{\ell}(t)=-\sin\theta(t)\dot{\theta}(t)$ . Meanwhile, using the equation $\ell(t)=\sqrt{\gamma(t)^{\top}\gamma(t)}$ , we also have

[TABLE]

Note that by definition, we have $\dot{\gamma}(t)^{\top}\dot{\gamma}(t)=1$ and $\gamma(t)^{\top}\dot{\gamma}(t)=\cos\theta(t)\sqrt{\gamma(t)^{\top}\gamma(t)}$ . Plugging into (B.7), we can derive

[TABLE]

Now we find a lower bound on $\ddot{\gamma}(t)^{\top}\gamma(t)$ . Specifically, by Cauchy-Schwarz inequality, we have

[TABLE]

The last inequality follows from $\left\lVert\ddot{\gamma}(t)\right\rVert_{2}\leq\frac{1}{\tau}$ (Niyogi et al., 2008) and $\left\lVert\gamma(t)\right\rVert_{2}\leq r$ . We now need to bound $\angle(\ddot{\gamma}(t),\gamma(t))$ , given $\angle\left(\gamma(t),\dot{\gamma}(t)\right)=\theta(t)$ and $\ddot{\gamma}(t)\perp\dot{\gamma}(t)$ . Consider the following optimization problem,

[TABLE]

By assigning $\mathbf{a}=\frac{\gamma(t)}{\left\lVert\gamma(t)\right\rVert_{2}}$ and $\mathbf{b}=\frac{\dot{\gamma}(t)}{\left\lVert\dot{\gamma}(t)\right\rVert_{2}}$ , the optimal objective value is exactly the minimum of $\cos\angle\left(\ddot{\gamma}(t),\gamma\right)$ . Additionally, we can find the maximum of $\cos\angle\left(\ddot{\gamma}(t),\gamma\right)$ by replacing the minimization in (B.9) by maximization. We solve (B.9) by the Lagrangian method. More precisely, let

[TABLE]

We have the optimal solution $\mathbf{x}^{*}$ satisfying $\nabla_{x}\mathcal{L}=0$ , which implies $\mathbf{x}^{*}=\frac{1}{2\lambda^{*}}(\mathbf{a}-\mu^{*}\mathbf{b})$ with $\mu^{*}$ and $\lambda^{*}$ being the optimal dual variable. By the primal feasibility, we have $\mu^{*}=\mathbf{a}^{\top}\mathbf{b}$ and $\lambda^{*}=-\frac{1}{2}\sqrt{1-(\mathbf{a}^{\top}\mathbf{b})^{2}}$ . Therefore, the optimal objective value is $-\sqrt{1-(\mathbf{a}^{\top}\mathbf{b})^{2}}$ . Similarly, the maximum is $\sqrt{1-(\mathbf{a}^{\top}\mathbf{b})^{2}}$ . Note that $\mathbf{a}^{\top}\mathbf{b}=\cos\theta(t)$ , we then get

[TABLE]

Substituting into (B.8), we have the following lower bound

[TABLE]

Now combining with $\ddot{\ell}(t)=-\sin\theta(t)\dot{\theta}(t)$ , we can derive

[TABLE]

Inequality (B.10) has an important implication: When $\sin\theta(t)>\frac{r}{\tau}$ , as $t$ increasing, $\theta(t)$ is monotone decreasing until $\sin\theta(t^{\prime})=\frac{r}{\tau}$ for some $t^{\prime}=t$ . Thus, we distinguish two cases depending on the value of $\theta(0)$ . Indeed, we only need to consider $\theta(0)\in[0,\pi/2]$ . The reason behind is that if $\theta(0)\in(\pi/2,\pi]$ , we only need to set the initial velocity in the opposite direction.

Case 1: $\theta(0)\in\left[0,\arcsin\frac{r}{\tau}\right]$ . We claim that $\theta(t)\in\left[0,\arcsin\frac{r}{\tau}\right]$ for all $t\leq T$ . In fact, suppose there exists some $t_{1}\leq T$ such that $\theta(t_{1})>\arcsin\frac{r}{\tau}$ . By the continuity of $\theta$ , there exists $t_{0}<t_{1}$ , such that $\theta(t_{0})=\arcsin\frac{r}{\tau}$ and $\theta(t)\geq\arcsin\frac{r}{\tau}$ for $t\in[t_{0},t_{1}]$ . This already gives us a contradiction:

[TABLE]

Therefore, we have $\dot{\ell}(t)\geq\cos\arcsin\frac{r}{\tau}=\sqrt{1-\frac{r^{2}}{\tau^{2}}}$ , and thus $T\leq\frac{\Delta}{r\sqrt{1-\frac{r^{2}}{\tau^{2}}}}$ .

Case 2: $\theta(0)\in\big{(}\arcsin\frac{r}{\tau},\pi/2\big{]}$ . It is enough to show that $\theta(0)$ can be bounded sufficiently away from $\pi/2$ . Let $\gamma_{\mathbf{c},\mathbf{x}}\subset\mathcal{M}$ be a geodesic from $\mathbf{c}_{i}$ to $\mathbf{x}$ . We analogously define $\theta_{\mathbf{c},\mathbf{x}}$ and $\ell_{\mathbf{c},\mathbf{x}}$ as for the geodesic from $\mathbf{x}$ to $\mathbf{y}$ . Let $T_{r/2}=\sup{\{t:\ell_{\mathbf{c},\mathbf{x}}(t)\leq r/2-\Delta/r\}}$ , and denote $\mathbf{z}=\gamma_{\mathbf{c},\mathbf{x}}(T_{r/2})$ . We must have $\theta_{\mathbf{c},\mathbf{x}}(T_{r/2})\in[0,\pi/2]$ and $\ell_{\mathbf{c},\mathbf{x}}(T_{r/2})=r/2-\Delta/r$ , otherwise there exists $T^{\prime}_{r/2}>T_{r/2}$ satisfying $\ell_{\mathbf{c},\mathbf{x}}(T^{\prime}_{r/2})\leq r/2$ . Denote $T_{\mathbf{x}}$ satisfying $\mathbf{x}=\gamma_{\mathbf{c},\mathbf{x}}(T_{\mathbf{x}})$ . We bound $\theta_{\mathbf{c},\mathbf{x}}(T_{\mathbf{x}})$ as follows,

[TABLE]

If there exists some $t\in(T_{r/2},T_{\mathbf{x}}]$ such that $\sin\theta_{\mathbf{c},\mathbf{x}}(t)\leq\frac{r}{\tau}$ , by the previous reasoning, we have $\sin\theta_{\mathbf{c},\mathbf{x}}(T_{\mathbf{x}})\leq\frac{r}{\tau}$ . Thus, we only need to handle the case when $\sin\theta_{\mathbf{c},\mathbf{x}}(t)>\frac{r}{\tau}$ for all $t\in(T_{r/2},T_{\mathbf{x}}]$ . In this case, $\theta_{\mathbf{c},\mathbf{x}}(t)$ is monotone decreasing, hence we further have

[TABLE]

The last inequality follows from $T_{\mathbf{x}}-T_{r/2}\geq r/2$ . Using the fact, $\sin x\geq\frac{2}{\pi}x$ , we can derive

[TABLE]

We can then set $\theta(0)=\theta_{\mathbf{c},\mathbf{x}}(T_{\mathbf{x}})$ , and thus

[TABLE]

Therefore, we have $T\leq\frac{\Delta}{r\cos\theta(0)}\leq\frac{\pi+1}{r(1-r/\tau)}\Delta$ . By the choice of $r\leq\tau/4$ , we immediately have $\frac{\tau}{\sqrt{\tau^{2}-r^{2}}}<\frac{\pi+1}{1-r/\tau}$ . Hence, combining case 1 and case 2, we conclude

[TABLE]

Therefore, the function value $f(\mathbf{x})$ on $\mathcal{K}_{i}$ is bounded by $\alpha_{i}\frac{\pi+1}{r(1-r/\tau)}\Delta$ . It suffices to set $c=\max_{i}\alpha_{i}b_{i}\left\lVert V_{i}\right\rVert_{2}$ , and we complete the proof. ∎

B.6 Characterization of the Size of the ReLU Network

Proof.

We evenly split the error $\epsilon$ into $3$ parts for $A_{i,1},A_{i,2}$ , and $A_{i,3}$ , respectively. We pick $\eta=\frac{\epsilon}{3C_{\mathcal{M}}}$ so that $\sum_{i=1}^{C_{\mathcal{M}}}A_{i,1}\leq\frac{\epsilon}{3}$ . The same argument yields $\delta=\frac{\epsilon}{3C_{\mathcal{M}}}$ . Analogously, we can choose $\Delta=\frac{r(1-r/\tau)\epsilon}{3c(\pi+1)C_{\mathcal{M}}}$ . Finally, we pick $\nu=\frac{\Delta}{16B^{2}D}$ so that $8B^{2}D\nu<\Delta$ .

Now we compute the number of layers, width, the number of neurons and weight parameters, and the range of each weight parameter in the ReLU network identified by Theorem 1.

For the chart determination sub-network, $\widehat{\mathds{1}}_{\Delta}$ can be implemented by a ReLU network with $\left\lceil\log\frac{r^{2}}{\Delta}\right\rceil$ layers and $2$ neurons in each layer. The weight parameters in the network is bounded by $O(\max\{\tau^{2},1\})$ . The approximation of the distance function $\widehat{d}_{i}^{2}$ can be implemented by a network of depth $O\left(\log\frac{1}{\nu}\right)$ , width bounded by a constant, and the number of neurons and weight parameters is at most $O\left(\log\frac{1}{\nu}\right)$ . Each weight parameter is bounded by $B$ . Plugging in our choice of $\nu$ and $\Delta$ , we have the depth is no greater than $c_{1}\left(\log\frac{1}{\epsilon}+\log D\right)$ with $c_{1}$ depending on $d,f,\tau$ , and the surface area of $\mathcal{M}$ . The number of neurons and weight parameters is also $c^{\prime}_{1}\left(\log\frac{1}{\epsilon}+\log D\right)$ except for a different constant. Note that there are $D$ parallel networks computing $\widehat{d}_{i}^{2}$ for $i=1,\dots,C_{\mathcal{M}}$ . Hence, the total number of neurons and weight parameters is $c^{\prime}_{1}C_{\mathcal{M}}D\left(\log\frac{1}{\epsilon}+\log D\right)$ with $c^{\prime}_{1}$ depending on $d,f,\tau$ , and the surface area of $\mathcal{M}$ . As can be seen, the width of the chart-determination network is bounded by $O(C_{\mathcal{M}}D)$ , and the weight parameter is bounded by $O(\max\{1,\tau^{2},B\})$ . 2. 2.

For the Taylor polynomial sub-network, $\phi_{i}$ can be implemented by a linear network with at most $Dd$ weight parameters. To implement each $\widehat{f}_{i}$ , we need a ReLU network of depth $c_{4}\log\frac{1}{\delta}$ . The number of neurons and weight parameters is $c^{\prime}_{4}\delta^{-\frac{d}{s+\alpha}}\log\frac{1}{\delta}$ , and the width is bounded by $c^{\prime\prime}_{4}\delta^{-\frac{d}{s+\alpha}}$ . Here $c_{4},c^{\prime}_{4},c^{\prime\prime}_{4}$ depend on $s,d,\tau,f_{i}\circ\phi_{i}^{-1}$ . In addition, all the weight parameters are bounded by the upper bound of the derivatives of $f_{i}\circ\phi_{i}^{-1}$ up to order $s$ (which scales as $\sqrt{d}$ as in Lemma 3). Substituting $\delta=\frac{\epsilon}{3C_{\mathcal{M}}}$ , we get the depth is $c_{2}\log\frac{1}{\epsilon}$ and the number of neurons and weight parameters is $c^{\prime}_{2}\epsilon^{-\frac{d}{s+\alpha}}\log\frac{1}{\epsilon}$ . There are totally $C_{\mathcal{M}}$ parallel $\widehat{f}_{i}$ ’s, hence the width is further bounded by $c^{\prime\prime}_{2}C_{\mathcal{M}}\epsilon^{-\frac{d}{s+\alpha}}$ . Meanwhile, the total number of neurons and weight parameters is $c^{\prime}_{2}C_{\mathcal{M}}\epsilon^{-\frac{d}{s+\alpha}}\log\frac{1}{\epsilon}$ . Here constants $c^{\prime}_{2}$ and $c^{\prime\prime}_{2}$ depend on $d,s,f_{i}\circ\phi_{i}^{-1},\tau$ , and the surface area of $\mathcal{M}$ . 3. 3.

For the product sub-network, the analysis is similar to the chart determination sub-network. The depth is $O\left(\log\frac{1}{\eta}\right)$ , the width is bounded by a constant, he number of neurons and weight parameters is $O\left(\log\frac{1}{\eta}\right)$ , and all the weight parameters are bounded by a constant. The choice of $\eta$ yields that the depth is $c_{3}\log\frac{1}{\epsilon}$ , and the number of neurons and weight parameters is $c^{\prime}_{3}\log\frac{1}{\epsilon}$ . There are $C_{\mathcal{M}}$ parallel pairs of outputs from the chart determination and the Taylor polynomial sub-networks. Hence, the total number of weight parameters is $c^{\prime}_{3}C_{\mathcal{M}}\log\frac{1}{\epsilon}$ with $c^{\prime}_{3}$ depending on $d,\tau$ , and the surface area of $\mathcal{M}$ .

Combining these 3 sub-networks, and redefining the constants $c_{1}$ , $c_{2}$ , $c_{3}$ and $c_{4}$ in the sequel, we obtain that the depth of the full network is $L=c_{1}\left(\log\frac{1}{\epsilon}+\log D\right)$ for some constant $c_{1}$ depending on $d,s,\tau$ , and the surface area of $\mathcal{M}$ . The depth of the neural network is bounded by $p=c_{2}(\epsilon^{-\frac{d}{s+\alpha}}+D)$ with $c_{2}$ depending on $d,s,\tau$ , the surface area of $\mathcal{M}$ , and the upper bounds on derivatives of $\phi_{i}$ ’s and $\rho_{i}$ ’s, up to order $s$ . The total number of neurons and weight parameters is $K=c_{3}\left(\epsilon^{-\frac{d}{s+\alpha}}\log\frac{1}{\epsilon}+D\log\frac{1}{\epsilon}+D\log D\right)$ for some constant $c_{3}$ depending on $d,s,f,\tau$ , and the surface area of $\mathcal{M}$ . Lastly, all the weight parameters in the network is bounded by $c_{4}\max\{1,\tau^{2},B,\sqrt{d}\}$ with $c_{4}$ depends on the upper bound of derivatives of $\rho_{i}$ ’s up to order $s$ . ∎

Appendix C Proof of Statistical Recovery of ReLU Network (Theorem 2)

This section consists of the detailed proofs, in Section C.1, C.2 and C.3, respectively, for upper bounding bias in Lemma 5, upper bounding variance in Lemma 6, and upper bounding covering number in Lemma 7. Lastly, the statistical bound in Theorem 2 is established in Section C.4 by choosing a proper approximation error and covering accuracy via the bias-variance trade-off argument.

C.1 Proof of Lemma 5

Proof.

$T_{1}$ essentially reflects the bias of estimating $f_{0}$ :

[TABLE]

where $(i)$ follows from $\mathbb{E}[\xi_{i}f_{0}(\mathbf{x}_{i})]=0$ due to the independence between $\xi_{i}$ and $\mathbf{x}$ , and $(ii)$ follows from Jensen’s inequality. Now we need to bound $\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\widehat{f}_{n}(\mathbf{x}_{i})\right]$ . We discretize the class $\mathcal{F}(R,\kappa,L,p,K)$ into $\mathcal{F}^{*}(R,\kappa,L,p,K)=\{f^{*}_{i}\}_{i=1}^{\mathcal{N}(\delta,\mathcal{F}(R,\kappa,L,p,K),\left\lVert\cdot\right\rVert_{\infty})}$ , where $\mathcal{N}(\delta,\mathcal{F}(R,\kappa,L,p,K),\left\lVert\cdot\right\rVert_{\infty})$ denotes the $\delta$ -covering number with respect to the $\ell_{\infty}$ norm. Accordingly, there exists $f^{*}$ such that $\|f^{*}-\widehat{f}_{n}\|_{\infty}\leq\delta$ . Denote $\|\widehat{f}_{n}-f_{0}\|_{n}^{2}=\frac{1}{n}\sum_{i=1}^{n}(\widehat{f}_{n}(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i}))^{2}$ . Then we have

[TABLE]

Here $(i)$ is obtained by applying Hölder’s inequality to $\xi_{i}(\widehat{f}_{n}(\mathbf{x}_{i})-f^{*}(\mathbf{x}_{i}))$ and invoking the Jensen’s inequality:

[TABLE]

Step $(ii)$ holds, since by invoking the inequality $2ab\leq a^{2}+b^{2}$ , we have

[TABLE]

To bound the expectation term in (C.2), we first break the dependence between $f^{*}$ and the samples $(\mathbf{x}_{i},y_{i})$ . In detail, we replace $f^{*}$ by any $f^{*}_{j}$ in the $\delta$ -covering, and observe that $\frac{\sum_{i=1}^{n}\xi_{i}(f^{*}(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i}))}{\sqrt{n}\left\lVert f^{*}-f_{0}\right\rVert_{n}}\leq\max_{j}\frac{\sum_{i=1}^{n}\xi_{i}(f_{j}^{*}(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i}))}{\sqrt{n}\|f_{j}^{*}-f_{0}\|_{n}}$ . For notational simplicity, we denote $z_{j}=\frac{\sum_{i=1}^{n}\xi_{i}(f_{j}^{*}(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i}))}{\sqrt{n}\|f_{j}^{*}-f_{0}\|_{n}}$ . Applying Cauchy-Schwarz inequality, we cast the expectation term in (C.2) as

[TABLE]

For given $\mathbf{x}_{1},\dots,\mathbf{x}_{n}$ , each term $\frac{\sum_{i=1}^{n}\xi_{i}(f_{j}^{*}(\mathbf{x}_{i})-f_{0}(\mathbf{x}_{i}))}{\sqrt{n}\|f_{j}^{*}-f_{0}\|_{n}}$ is sub-guassian with parameter $\sigma$ . Consequently, the last inequality (C.3) involves the maximum of a collection of squared sub-Gaussian random variables $z_{j}^{2}$ . Indeed, $z_{j}^{2}$ is sub-exponential for each $j$ . We can bound it using the moment generating function: For any $t>0$ , we have

[TABLE]

Since $z_{1}$ is $\sigma^{2}$ -sub-Gaussian given $\mathbf{x}_{1},\dots,\mathbf{x}_{n}$ , we derive

[TABLE]

Taking $t=(3\sigma^{2})^{-1}$ and substituting into (C.4), we deduce $\mathbb{E}\left[\max_{j}~{}z_{j}^{2}~{}|~{}\mathbf{x}_{1},\dots,\mathbf{x}_{n}\right]$ is bounded by

[TABLE]

Combining (C.5), (C.3), (C.2), and substituting back into (C.1), we obtain the following implicit error estimation on $T_{1}$ :

[TABLE]

We denote $v=\sqrt{\mathbb{E}\left[\|\widehat{f}_{n}-f_{0}\|_{n}^{2}\right]}$ . Then the above implicit bound on $T_{1}$ implies

[TABLE]

Rearranging (C.6) for $a,b>0$ , we deduce $(v-a)^{2}\leq b+a^{2}$ . Some manipulation then yields $v^{2}\leq 4a^{2}+2b$ , which implies

[TABLE]

The proof is complete. ∎

C.2 Proof of Lemma 6

Proof.

Recall that we denote $\widehat{g}(\mathbf{x})=(\widehat{f}_{n}(\mathbf{x})-f_{0}(\mathbf{x}))^{2}$ . We rewrite $T_{2}$ as

[TABLE]

We lower bound $\int_{\mathcal{M}}\widehat{g}(\mathbf{x})d\mathcal{D}_{x}(\mathbf{x})$ by its second moment:

[TABLE]

The last inequality follows from $\left|\widehat{f}_{n}(\mathbf{x})-f_{0}(\mathbf{x})\right|\leq 2R$ . Now we cast $T_{2}$ into

[TABLE]

Introducing the second moment allows us to establish a fast convergence of $T_{2}$ . Specifically, we denote $\bar{\mathbf{x}}_{i}$ ’s as independent copies of $\mathbf{x}_{i}$ ’s following the same distribution. We also denote

[TABLE]

as the function class induced by $\mathcal{F}(R,\kappa,L,p,K)$ . Then we upper bound (C.7) as

[TABLE]

where $(i)$ follows from Jensen’s inequality and shorthand $\mathbb{E}_{\mathbf{x},\bar{\mathbf{x}}}[\cdot]$ denotes the expectation (double integral $\int_{\mathcal{M}}\int_{\mathcal{M}}\cdot d\mathcal{D}_{x}(\mathbf{x})d\mathcal{D}_{x}(\bar{\mathbf{x}})$ ) with respect to the joint distribution of $(\mathbf{x},\bar{\mathbf{x}})$ .

We discretize $\mathcal{G}$ with respect to the $\ell_{\infty}$ norm. The $\delta$ -covering number is denoted as $\mathcal{N}(\delta,\mathcal{G},\left\lVert\cdot\right\rVert_{\infty})$ and the elements in the covering is denoted as $\mathcal{G}^{*}=\left\{g^{*}_{i}\right\}_{i=1}^{\mathcal{N}(\delta,\mathcal{G},\left\lVert\cdot\right\rVert_{\infty})}$ , that is, for any $g\in\mathcal{G}$ , there exists a $g^{*}$ satisfying $\left\lVert g-g^{*}\right\rVert_{\infty}\leq\delta$ .

We replace $g\in\mathcal{G}$ by $g^{*}\in\mathcal{G}^{*}$ in bounding $T_{2}$ , which then boils down to deriving concentration results on a finite concept class. Specifically, for $g^{*}$ satisfying $\left\lVert g-g^{*}\right\rVert_{\infty}\leq\delta$ , we have

[TABLE]

We also have

[TABLE]

Plugging the above two items into (C.8), we upper bound $T_{2}$ as

[TABLE]

Denote $h_{j}(i)=g^{*}_{j}(\bar{\mathbf{x}}_{i})-g^{*}_{j}(\mathbf{x}_{i})$ . By symmetry, it is straightforward to see $\mathbb{E}[h_{j}(i)]=0$ . The variance of $h_{j}(i)$ is computed as

[TABLE]

The last inequality $(i)$ utilizes the identity $(a-b)^{2}\leq 2(a^{2}+b^{2})$ . Therefore, we derive the following upper bound for $T_{2}$ ,

[TABLE]

We invoke the moment generating function to bound $T_{2}$ . Note that we have $\|h_{j}\|_{\infty}\leq(2R)^{2}$ . Then by Taylor expansion, for $0<t/n<\frac{3}{4R^{2}}$ and any $j$ , we have

[TABLE]

Step $(i)$ follows from the fact $1+x\leq\exp(x)$ for $x\geq 0$ . Given (C.2), we proceed to bound $T_{2}$ . To ease the presentation, we temporarily neglect $\left(4+\frac{1}{2R}\right)\delta$ term and denote $T^{\prime}_{2}=T_{2}-\left(4+\frac{1}{2R}\right)\delta$ . Then for $0<t/n<\frac{3}{4R^{2}}$ , we have

[TABLE]

Step $(i)$ follows from Jensen’s inequality, and step $(ii)$ invokes (C.2) for each $h(i)$ . We now choose $t$ so that $\frac{3t/n}{6-8tR^{2}/n}-\frac{1}{32R^{2}}=0$ , which yields $t=\frac{3n}{52R^{2}}<\frac{3n}{4R^{2}}$ . Substituting our choice of $t$ into $\exp(tT^{\prime}_{2}/2)$ , we have

[TABLE]

To complete the proof, we relate the covering number of $\mathcal{G}$ to that of $\mathcal{F}(R,\kappa,L,p,K)$ . Consider any $g_{1},g_{2}\in\mathcal{G}$ with $g_{1}=(f_{1}-f_{0})^{2}$ and $g_{2}=(f_{2}-f_{0})^{2}$ , respectively, for $f_{1},f_{2}\in\mathcal{F}(R,\kappa,L,p,K)$ . We can derive

[TABLE]

The above characterization immediately implies $\mathcal{N}(\delta,\mathcal{G},\left\lVert\cdot\right\rVert_{\infty})\leq\mathcal{N}(\delta/4R,\mathcal{F}(R,\kappa,L,p,K),\left\lVert\cdot\right\rVert_{\infty})$ . Therefore, we derive the desired upper bound on $T_{2}$ :

[TABLE]

∎

C.3 Proof of Lemma 7

Proof.

To construct a covering for $\mathcal{F}(R,\kappa,L,p,K)$ , we discretize each weight parameter by a uniform grid with grid size $h$ . Recall we write $f\in\mathcal{F}(R,\kappa,L,p,K)$ as $f=W_{L}\cdot\textrm{ReLU}(W_{L-1}\cdots\textrm{ReLU}(W_{1}\mathbf{x}+\mathbf{b}_{1})\dots+\mathbf{b}_{L-1})+\mathbf{b}_{L}$ . Let $f,f^{\prime}\in\mathcal{F}$ with all the weight parameters at most $h$ from each other. Denoting the weight matrices in $f,f^{\prime}$ as $W_{L},\dots,W_{1},\mathbf{b}_{L},\dots,\mathbf{b}_{1}$ and $W^{\prime}_{L},\dots,W^{\prime}_{1},\mathbf{b}^{\prime}_{L},\dots,\mathbf{b}^{\prime}_{1}$ , respectively, we bound the $\ell_{\infty}$ difference $\left\lVert f-f^{\prime}\right\rVert_{\infty}$ as

[TABLE]

We derive the following bound on $\left\lVert W_{L-1}\cdots\textrm{ReLU}(W_{1}\mathbf{x}+\mathbf{b}_{1})\dots+\mathbf{b}_{L-1}\right\rVert_{\infty}$ :

[TABLE]

where $(i)$ is obtained by induction and $\left\lVert\mathbf{x}\right\rVert_{\infty}\leq B$ . The last inequality holds, since $\kappa p>1$ . Substituting back into the bound for $\left\lVert f-f^{\prime}\right\rVert_{\infty}$ , we have

[TABLE]

where $(i)$ is obtained by induction. We choose $h$ satisfying $hL(pB+2)(\kappa p)^{L-1}=\delta$ . Then discretizing each parameter uniformly into $2\kappa/h$ grid points yields a $\delta$ -covering on $\mathcal{F}$ . Note that there are ${Lp^{2}\choose K}\leq(Lp^{2})^{K}$ different choices of $K$ non-zero entries out of $Lp^{2}$ total weight parameters. Therefore, the covering number is upper bounded by

[TABLE]

∎

C.4 Proof of Theorem 2 — Bias-Variance Trade-off

Proof.

We recall the bias and variance decomposition of $\mathbb{E}\left[\int_{\mathcal{M}}\left(\widehat{f}_{n}(\mathbf{x})-f_{0}(\mathbf{x})\right)^{2}d\mathcal{D}_{x}(\mathbf{x})\right]$ as

[TABLE]

Combining the upper bounds on $T_{1}$ and $T_{2}$ in Lemmas 5 and 6, we can derive

[TABLE]

By our choice of $\mathcal{F}(R,\kappa,L,p,K)$ , there exists a network class which can yield a function $f$ satisfying $\left\lVert f-f_{0}\right\rVert_{\infty}\leq\epsilon$ for $\epsilon\in(0,1)$ . We will choose $\epsilon$ later for the bias-variance trade-off. Such a network consists of $L=\widetilde{O}\left(\log\frac{1}{\epsilon}\right)$ layers and $K=\widetilde{O}\left(\left(\epsilon^{-\frac{d}{s+\alpha}}+D\right)\log\frac{1}{\epsilon}\right)$ weight parameters. Invoking the upper bound of the covering number in Lemma 7, we derive

[TABLE]

Now we choose $\epsilon$ to satisfy $\epsilon^{2}=\frac{1}{n}\epsilon^{-\frac{d}{s+\alpha}}$ , which gives $\epsilon=n^{-\frac{s+\alpha}{d+2(s+\alpha)}}$ . It suffices to pick $\delta=\frac{1}{n}$ . Substitute both $\epsilon$ and $\delta$ into (C.10), we deduce the desired estimation error bound

[TABLE]

where constant $c$ depends on depending on $\log D$ , $d$ , $s$ , $\tau$ , $B$ , the surface area of $\mathcal{M}$ , and the upper bounds of derivatives of the coordinate systems $\phi_{i}$ ’s and partition of unity $\rho_{i}$ ’s, up to order $s$ . ∎

Bibliography78

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aamari et al. (2019) Aamari, E. , Kim, J. , Chazal, F. , Michel, B. , Rinaldo, A. and Wasserman, L. (2019). Estimating the reach of a manifold. Electron. J. Stat. , 13 1359–1399.
2Allard et al. (2012) Allard, W. K. , Chen, G. and Maggioni, M. (2012). Multi-scale geometric methods for data sets ii: Geometric multi-resolution analysis. Appl. Comput. Harmon. Anal. , 32 435–462.
3Altman (1992) Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. Amer. Statist. , 46 175–185.
4Amodei et al. (2016) Amodei, D. , Ananthanarayanan, S. , Anubhai, R. , Bai, J. , Battenberg, E. , Case, C. , Casper, J. , Catanzaro, B. , Cheng, Q. , Chen, G. et al. (2016). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning . PMLR.
5Bahdanau et al. (2014) Bahdanau, D. , Cho, K. and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473 .
6Barron (1991) Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In Nonparametric functional estimation and related topics . Springer, 561–576.
7Barron (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory , 39 930–945.
8Bickel and Li (2007) Bickel, P. J. and Li, B. (2007). Local polynomial regression on unknown manifolds. Lecture Notes-Monograph Series , 54 177–186.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery

Abstract

1 Introduction

1.1 Summary of main results

1.2 Related Work

1.3 Roadmap and Notations

2 Preliminaries

Definition 1** (Chart).**

Example 1** (Projection to Tangent Space).**

Definition 2** (CkC^{k}Ck Atlas).**

Definition 3** (Smooth Manifold).**

Example 2**.**

Definition 4** (CsC^{s}Cs Functions on M\mathcal{M}M).**

Remark 1**.**

Definition 5** (Hölder Functions on M\mathcal{M}M).**

Definition 6** (Partition of Unity, Definition 13.4 in Tu (2010)).**

Proposition 1** (Existence of a C∞C^{\infty}C∞ partition of unity, Theorem 13.7 in Tu (2010)).**

Definition 7** (Reach (Federer, 1959), Definition 2.1 in Aamari et al. (2019)).**

3 Main Results

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Assumption 4**.**

3.1 Universal Approximation Theory

Theorem 1**.**

Proposition 2**.**

3.2 Statistical Estimation Theory

Theorem 2**.**

4 Proof of Approximation Theory

Lemma 1** (Proposition 222 in Yarotsky (2017)).**

Corollary 1** (Proposition 333 in Yarotsky (2017)).**

Lemma 2**.**

Remark 2**.**

Lemma 3**.**

Proof Sketch.

Theorem 3**.**

Proof Sketch.

Lemma 4**.**

5 Proof of the Statistical Estimation Theory

5.1 Bias Characterization — Bounding T1T_{1}T1​

Lemma 5**.**

Proof Sketch.

5.2 Variance Characterization — Bounding T2T_{2}T2​

Lemma 6**.**

Proof Sketch.

5.3 Covering Number of Neural Networks

Lemma 7**.**

Proof Sketch.

5.4 Bias-Variance Trade-off

6 Conclusion

Acknowledgment

Appendix A Proofs of the Preliminary Results in Section 4

A.1 Proof of Lemma 1

Proof.

A.2 Proof of Corollary 1

Proof.

Appendix B Proof of Approximation Theory of ReLU Network (Theorem 1)

B.1 Proof of Lemma 2

Proof.

Proposition 3** (Proposition 6.10 in Tu (2010)).**

B.2 Proof of Lemma 3

Proof.

B.3 Proof of Theorem 3

Proof.

B.4 Proof of Lemma 4

Proof.

B.5 Helper Lemma for Bounding Ai,3A_{i,3}Ai,3​ and Its Proof

Lemma 8**.**

Proof.

B.6 Characterization of the Size of the ReLU Network

Proof.

Appendix C Proof of Statistical Recovery of ReLU Network (Theorem 2)

C.1 Proof of Lemma 5

Definition 1 (Chart).

Example 1 (Projection to Tangent Space).

Definition 2 ( $C^{k}$ Atlas).

Definition 3 (Smooth Manifold).

Example 2.

Definition 4 ( $C^{s}$ Functions on $\mathcal{M}$ ).

Remark 1.

Definition 5 (Hölder Functions on $\mathcal{M}$ ).

Definition 6 (Partition of Unity, Definition 13.4 in Tu (2010)).

Proposition 1 (Existence of a $C^{\infty}$ partition of unity, Theorem 13.7 in Tu (2010)).

Definition 7 (Reach (Federer, 1959), Definition 2.1 in Aamari et al. (2019)).

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Theorem 1.

Proposition 2.

Theorem 2.

Lemma 1 (Proposition $2$ in Yarotsky (2017)).

Corollary 1 (Proposition $3$ in Yarotsky (2017)).

Lemma 2.

Remark 2.

Lemma 3.

Theorem 3.

Lemma 4.

5.1 Bias Characterization — Bounding $T_{1}$

Lemma 5.

5.2 Variance Characterization — Bounding $T_{2}$

Lemma 6.

Lemma 7.

Proposition 3 (Proposition 6.10 in Tu (2010)).

B.5 Helper Lemma for Bounding $A_{i,3}$ and Its Proof

Lemma 8.