Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery
Minshuo Chen, Haoming Jiang, Wenjing Liao, Tuo Zhao

TL;DR
This paper demonstrates that deep ReLU networks can effectively perform nonparametric regression on data supported on low-dimensional manifolds, achieving fast convergence rates that depend on the intrinsic dimension rather than the ambient space.
Contribution
The paper introduces a deep ReLU network architecture for nonparametric regression on manifolds and proves its convergence rate depends on the intrinsic dimension, showing adaptivity to geometric structures.
Findings
Convergence rate of $n^{-rac{2(s+eta)}{2(s+eta)+d}}\
Deep ReLU networks adapt to low-dimensional manifold structures in high-dimensional data.
Theoretical analysis supports the effectiveness of deep networks for geometric data approximation.
Abstract
Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of H\"{o}lder functions on low-dimensional manifolds using deep ReLU networks. Suppose training data are sampled from a H\"{o}lder function in supported on a -dimensional Riemannian manifold isometrically embedded in , with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of . This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension , which is usually much smaller than the ambient dimension . It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods*Communicated@Fast*How Do I Communicate to Expedia?
Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery
Minshuo Chen, Haoming Jiang, Wenjing Liao, Tuo Zhao Alphabetical order. Minshuo Chen, Haoming Jiang, and Tuo Zhao are affiliated with School of Industrial and Systems Engineering at Georgia Tech; Wenjing Liao is affiliated with School of Mathematics at Georgia Tech; Email:mchen393, hmjiang, tourzhao, wliao60@gatech.edu.
Abstract
Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of Hölder functions on low-dimensional manifolds using deep ReLU networks. Suppose training data are sampled from a Hölder function in supported on a -dimensional Riemannian manifold isometrically embedded in , with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of . This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension , which is usually much smaller than the ambient dimension . It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometric structures of data, and partially explains the power of deep ReLU networks in tackling high-dimensional data with low-dimensional geometric structures.
1 Introduction
Deep learning has made astonishing breakthroughs in various real-world applications, such as computer vision (Krizhevsky et al., 2012; Goodfellow et al., 2014; Long et al., 2015), natural language processing (Graves et al., 2013; Bahdanau et al., 2014; Young et al., 2018), healthcare (Miotto et al., 2017; Jiang et al., 2017), robotics (Gu et al., 2017), etc. For example, in image classification, the winner of the ImageNet challenge retained a top- error rate of (Hu et al., 2018), while the data set consists of about million labeled high resolution images in categories. In speech recognition, Amodei et al. (2016) reported that deep neural networks outperformed humans with a word error rate on the LibriSpeech corpus constructed from audio books (Panayotov et al., 2015). Such a data set consists of approximately hours of kHz read English speech from audio books.
The empirical success of deep learning brings new challenges to the conventional wisdom of machine learning. Data sets in these applications are in high-dimensional spaces. In existing literature, a minimax lower bound has been established for the optimal algorithm of learning functions in (Györfi et al., 2006; Tsybakov, 2008). Denote the underlying function by . The minimax lower bound suggests a pessimistic sample complexity: To obtain an estimator for each function with an -error, uniformly for all functions (i.e., with denoting the function norm), the optimal algorithm requires the sample size in the worst scenario (i.e., when is the most difficult for the algorithm to estimate). We instantiate such a sample complexity bound to the ImageNet data set, which consists of RGB images with a resolution of . The theory above suggests that, to achieve an -error, the number of samples has to scale as , where the smoothness parameter is significantly smaller than . Setting already gives rise to a huge number of samples far beyond practical applications, which well exceeds million labeled images in ImageNet.
To bridge the aforementioned gap between theory and practice, we take the low-dimensional geometric structures in data sets into consideration. This is motivated by the fact that real-world data sets often exhibit low-dimensional structures. Many images consist of projections of a three-dimensional object followed by some transformations, such as rotation, translation, and skeleton. This generating mechanism induces a small number of intrinsic parameters (Hinton and Salakhutdinov, 2006; Osher et al., 2017). Speech data are composed of words and sentences following the grammar, and therefore have small degrees of freedom (Djuric et al., 2015). More broadly, visual, acoustic, textual, and many other types of data often have low-dimensional geometric structures due to rich local regularities, global symmetries, repetitive patterns, or redundant sampling (Tenenbaum et al., 2000; Roweis and Saul, 2000; Coifman et al., 2005; Allard et al., 2012). It is therefore reasonable to assume that data lie on a manifold of dimension .
1.1 Summary of main results
In this paper, we study nonparametric regression problems (Wasserman, 2006; Györfi et al., 2006; Tsybakov, 2008) using neural networks in exploitation of low-dimensional geometric structures of data. Specifically, we model data as samples from a probability measure supported on a -dimensional Riemannian manifold isometrically embedded in where . The goal is to recover the regression function supported on using the samples with and . The ’s are i.i.d. sampled from a distribution on , and the response satisfies
[TABLE]
where ’s are i.i.d. sub-Gaussian noise independent of ’s.
We use multi-layer ReLU (Rectified Linear Unit) neural networks to recover . ReLU networks are widely used in computer vision, speech recognition, natural language processing, etc. (Nair and Hinton, 2010; Glorot et al., 2011; Maas et al., 2013). These networks can ease the notorious vanishing gradient issue during training, which commonly arises with sigmoid or hyperbolic tangent activations (Glorot et al., 2011; Goodfellow et al., 2016). Given an input , an -layer ReLU neural network computes an output as
[TABLE]
where and are weight matrices and vectors of proper sizes, respectively, and denotes the entrywise rectified linear unit (i.e., ). We denote as a class of neural networks with bounded weight parameters and bounded output (we refer to as a ReLU network structure throughout the rest of the paper):
[TABLE]
where denotes the number of nonzero entries in a vector or a matrix, denotes norm of a function or entrywise norm of a vector. For a matrix , we have .
To obtain an estimator of , we minimize the empirical quadratic risk
[TABLE]
The subscript emphasizes that the estimator is obtained using pairs of samples. Our theory shows that converges to at a fast rate depending on the intrinsic dimension , under some mild regularity conditions. We assume is an -Hölder function on , where is an integer and . For the network class , we choose
[TABLE]
and set as a constant depending on , , and . Here we use to hide factors depending on and logarithmic factors (e.g., ). Then the empirical minimizer of (3) gives rise to
[TABLE]
where the expectation is taken over the training samples , is the variance proxy of sub-Gaussian noise , and is a constant depending on , , , and (see a formal statement in Theorem 2).
Our theory implies that, in order to estimate an -Hölder function up to an -error, the sample complexity is up to a log factor. This sample complexity depends on the intrinsic dimension , and thus largely improves on existing theories of nonparametric regression using neural networks, where the sample complexity scales as (Hamers and Kohler, 2006; Kohler and Krzyżak, 2005, 2016; Kohler and Mehnert, 2011; Schmidt-Hieber, 2017). Our result partially explains the success of deep ReLU neural networks in tackling high-dimensional data with low-dimensional geometric structures.
An ingredient in our analysis is an efficient universal approximation theory of deep ReLU networks for -Hölder functions on (Theorem 1). A preliminary version of the approximation theory appeared in Chen et al. (2019). Specifically, we show that, in order to uniformly approximate -Hölder functions on a -dimensional manifold with an -error, the network consists of at most layers and neurons and weight parameters (see Theorem 1). The network size in our approximation theory weakly depends on the data dimension , which significantly improves on existing universal approximation theories of neural networks (Barron, 1993; Mhaskar, 1996; Lu et al., 2017; Hanin, 2017; Yarotsky, 2017), where the network size scales as . Figure 1 illustrates a huge gap between the network sizes used in practice (Tan and Le, 2019) and the required size predicted by existing theories, e.g., Yarotsky (2017) for the ImageNet data set. Our approximation theory partially bridges this gap by exploiting the data intrinsic geometric structures, and justifies why neural networks of moderate size have achieved a great success in various applications. Meanwhile, our network size also matches its lower bound up to logarithmic factors for a given manifold (see Proposition 2).
1.2 Related Work
Nonparametric regression has been widely studied in statistics. A variety of methods has been proposed to estimate the regression function, including kernel methods, wavelets, splines, and local polynomials (Wahba, 1990; Altman, 1992; Fan and Gijbels, 1996; Tsybakov, 2008; Györfi et al., 2006). Nonetheless, there is limited study on regression using deep ReLU networks until recently. The earliest works focused on neural networks with a single hidden layer and smooth activations (e.g., sigmoidal and sinusoidal functions, (Barron, 1991; McCaffrey and Gallant, 1994)). Later results achieved the minimax lower bound for the mean squared error in the order of up to a logarithmic factor for functions in (Hamers and Kohler, 2006; Kohler and Krzyżak, 2005, 2016; Kohler and Mehnert, 2011). Theories for deep ReLU networks were developed in Schmidt-Hieber (2017), where the estimate matches the minimax lower bound up to a logarithmic factor for Hölder functions. Extensions to more general function spaces, such as Besov spaces, can be found in Suzuki (2019) and results for classification problems can be found in Kim et al. (2018); Ohn and Kim (2019).
The rate of convergence in the results above cannot fully explain the success of deep learning due to the curse of the data dimension with a large . Fortunately, many real-world data sets exhibit low-dimensional geometric structures. It has been demonstrated that, some classical methods are adaptive to the low-dimensional structures of data sets, and perform as well as if the low-dimensional structures were known. Results in this direction include local linear regression (Bickel and Li, 2007; Cheng and Wu, 2013), multiscale polynomial regression (Liao et al., 2021), -nearest neighbor (Kpotufe, 2011), kernel regression (Kpotufe and Garg, 2013), and Bayesian Gaussian process regression (Yang et al., 2015), where optimal rates depending on the intrinsic dimension were proved for functions having the second order of continuity (Bickel and Li, 2007), globally Lipschitz functions (Kpotufe, 2011), and Hölder functions with Hölder index no more than (Kpotufe and Garg, 2013).
Recently, several independent works (Schmidt-Hieber, 2019; Nakada and Imaizumi, 2020; Cloninger and Klock, 2020) justified the adaptability of deep neural networks to the low-dimensional data structures. Schmidt-Hieber (2019) considered function approximation and regression of Hölder functions on a low-dimensional manifold, which is similar to the setup in this paper. The proofs in Schmidt-Hieber (2019) and this paper both utilize a collection of charts to map each point on into a local coordinate in , and then approximate functions in . There are two differences in the detailed proof: (1) In exploitation of a positive reach property of , we construct local coordinates on the manifold given by orthogonal projections onto the tangent spaces, while Schmidt-Hieber (2019) assumed the existence of smooth local coordinates; (2) A main novelty of our work is to explicitly construct a chart determination sub-network which assigns each data point to its proper chart. In Schmidt-Hieber (2019), the chart determination is realized by the partition of unity. In order to approximate functions in , Schmidt-Hieber (2019) required a uniform upper bound on the derivatives of each coordinate map and each function in the partition of unity, up to order . Our proof does not rely on such regularity conditions depending on the ambient dimension . To describe the intrinsic dimensionality of data, Nakada and Imaizumi (2020) applied the notion of Minkowski dimension, which can be defined for a broader class of sets without smoothness restrictions. The intrinsic dimension of manifolds and the Minkowski dimension are different notions for low-dimensional sets, and one does not naturally imply the other. Schmidt-Hieber (2019) and Nakada and Imaizumi (2020) established a convergence rate of the mean squared error for learning functions in , where is the manifold dimension in Schmidt-Hieber (2019) and Minkowski dimension in Nakada and Imaizumi (2020), respectively. Recently Cloninger and Klock (2020) studied the approximation and regression error of ReLU neural networks for a class of functions in the form of , where is near the low-dimensional manifold , is a projection onto , and is a Hölder function on .
A crucial ingredient in the statistical analysis of neural networks is the universal approximation ability of neural networks. Early works in literature justified the existence of two-layer networks with continuous sigmoidal activations (a function is sigmoidal, if as , and as ) for a universal approximation of continuous functions in a unit hypercube (Irie and Miyake, 1988; Funahashi, 1989; Cybenko, 1989; Hornik, 1991; Chui and Li, 1992; Leshno et al., 1993). In these works, the number of neurons was not explicitly given. Later, Barron (1993); Mhaskar (1996) proved that the number of neurons can grow as where is the uniform approximation error. Recently, Lu et al. (2017); Hanin (2017) and Daubechies et al. (2019) extended the universal approximation theory to networks of bounded width with ReLU activations. The depth of such networks grows exponentially with respect to the dimension of data. Yarotsky (2017) showed that ReLU neural networks can uniformly approximate functions in Sobolev spaces, where the network size scales exponentially with respect to the data dimension and matches the lower bound. Zhou (2019) also developed a universal approximation theory for deep convolutional neural networks (Krizhevsky et al., 2012), where the depth of the network scales exponentially with respect to the data dimension.
The aforementioned results focus on functions on a compact subset (e.g., ) in . Function approximation on manifolds has been well studied using classical methods, such as local polynomials (Bickel and Li, 2007) and wavelets (Coifman and Maggioni, 2006). However, studies using neural networks are limited. Two noticeable works are Chui and Mhaskar (2016) and Shaham et al. (2018). In Chui and Mhaskar (2016), high order differentiable functions on manifolds are approximated by neural networks with smooth activations, e.g., sigmoid activations and rectified quadratic unit functions (). These smooth activations are not commonly used in mainstream applications such as computer vision (Krizhevsky et al., 2012; Long et al., 2015; Hu et al., 2018). In Shaham et al. (2018), a -layer network with ReLU activations was proposed to approximate functions on low-dimensional manifolds. This theory does not cover arbitrarily functions. We are also aware of a concurrent work of ours, Shen et al. (2019), which established an approximation theory of ReLU networks for Hölder functions in terms of a modulus of continuity. When the target function belongs to the Hölder class supported in a neighborhood of a -dimensional manifold embedded in , Shen et al. (2019) constructed a ReLU network which yields an approximation error in the order of where and are the width and depth of the network, and . Their proof utilizes a different approach compared to ours: They first construct a piecewise constant function to approximate the target function, and then implement the piecewise constant function using a ReLU network. The higher order smoothness for functions while is not exploited due to the use of piecewise constant approximations.
1.3 Roadmap and Notations
The rest of the paper is organized as follows: Section 2 presents a brief introduction to manifolds and functions on manifolds. Section 3 presents a statistical estimation theory of functions on low-dimensional manifolds using deep ReLU neural networks, and a universal approximation theory; Section 4 sketches the proof of the approximation theory. Section 5 sketches the proof of the statistical estimation theory in Section 3, and the detailed proofs are deferred to Appendix; Section 6 provides a conclusion of the paper.
We use bold-faced letters to denote vectors, and normal font letters with a subscript to denote its coordinate, e.g., and being the -th coordinate of . Given a vector , we define and . We define . Given a function , we denote its derivative as , and its norm as . We use to denote the composition operator.
2 Preliminaries
We briefly review manifolds, partition of unity, and function spaces defined on smooth manifolds. Details can be found in Tu (2010) and Lee (2006). Let be a -dimensional Riemannian manifold isometrically embedded in .
Definition 1** (Chart).**
A chart for is a pair such that is open and where is a homeomorphism (i.e., bijective, and are both continuous).
The open set is called a coordinate neighborhood, and is called a coordinate system on . A chart essentially defines a local coordinate system on . Given a suitable coordinate neighborhood around a point on the manifold , we denote as the orthogonal projection onto the tangent space at , which gives a particular coordinate system on .
Example 1** (Projection to Tangent Space).**
Let be the tangent space of at the point (see the formal definition in Tu (2010, Section 8.1)). We denote as an orthonormal basis of . Then the orthogonal projection onto the tangent space is defined as for with .
We say two charts and on are compatible if and only if the transition functions,
[TABLE]
are both .
Definition 2** ( Atlas).**
A atlas for is a collection of pairwise compatible charts such that .
Definition 3** (Smooth Manifold).**
A smooth manifold is a manifold together with a atlas.
Classical examples of smooth manifolds are the Euclidean space , the torus, and the unit sphere. We further define a Riemannian manifold as a pair , where is a smooth manifold and is a Riemannian metric (Lee, 2018, Chapter 2). To better interpret Definition 2 and 3, we give an example of a atlas on the unit sphere in .
Example 2**.**
We denote as the unit sphere in , i.e., . The following atlas of consists of overlapping charts corresponding to hemispheres:
[TABLE]
Here is the orthogonal projection onto the tangent space at the pole of each hemisphere. Moreover, all the six charts are compatible, and therefore, form an atlas of .
For a general compact smooth manifold , we can construct an atlas using orthogonal projections to tangent spaces as local coordinate systems. Let be the orthogonal projection to the tangent space for . Let be an open coordinate neighborhood containing such that is a homeomorphism. Since is compact, there exist a finite number of points such that the charts form an atlas of .
The existence of an atlas on allows us to define differentiable functions.
Definition 4** ( Functions on ).**
Let be a -dimensional Riemannian manifold isometrically embedded in . A function is if for any chart , the composition is continuously differentiable up to order .
Remark 1**.**
The definition of functions is independent of the choice of the chart . Suppose is another chart and . Then we have
[TABLE]
Since is a smooth manifold, and are compatible. Thus, is and is , and their composition is .
We next generalize the definition of functions to Hölder functions on the smooth manifold .
Definition 5** (Hölder Functions on ).**
Let be a -dimensional compact Riemannian manifold isometrically embedded in . Let be an atlas of where the ’s are orthogonal projections onto tangent spaces. For a positive integer and , a function is -Hölder continuous if for each chart in the atlas, we have
with for any ; 2. 2.
for any and ,
[TABLE]
Moreover, we denote the collection of -Hölder functions on as .
Definition 5 requires that all -th order derivatives of are Hölder continuous. We recover the standard Hölder class on a Euclidean space if is the identity mapping. We next introduce the partition of unity, which plays a crucial role in our construction of neural networks.
Definition 6** (Partition of Unity, Definition 13.4 in Tu (2010)).**
A partition of unity on a manifold is a collection of nonnegative functions for such that
the collection of supports, is locally finite, i.e., every point on has a neighborhood that meets only finitely many of ’s; 2. 2.
.
For a smooth manifold, a partition of unity always exists.
Proposition 1** (Existence of a partition of unity, Theorem 13.7 in Tu (2010)).**
Let be an open cover of a compact smooth manifold . Then there is a partition of unity where every has a compact support such that .
Proposition 1 gives rise to the decomposition with . Note that the ’s have the same regularity as , since
[TABLE]
for a chart . This decomposition implies that we can express as a sum of the ’s, where every is only supported in a single chart.
To characterize the curvature of a manifold, we adopt the following geometric concept.
Definition 7** (Reach (Federer, 1959), Definition 2.1 in Aamari et al. (2019)).**
Denote
[TABLE]
as the set of points that have at least two nearest neighbors on . The reach is defined as
[TABLE]
Reach has a straightforward geometrical interpretation: At each point , the radius of the osculating circle is greater or equal to . Intuitively, a large reach for requires the manifold not to change “rapidly” as shown in Figure 2.
In our proof for the universal approximation theory, reach determines a proper choice of an atlas for . In Section 4, we choose each chart to be contained in a ball of radius less than . For smooth manifolds with a small , we need a large number of charts. Therefore, reach of a smooth manifold reflects the complexity of the neural network for function approximation on .
3 Main Results
This section contains our main statistical estimation theory for Hölder functions on low-dimensional manifolds using deep neural networks. We begin with some assumptions on the regression model and the manifold .
Assumption 1**.**
is a -dimensional compact Riemannian manifold isometrically embedded in . There exists a constant such that, for any point , we have for all .
Assumption 2**.**
The reach of is .
Assumption 3**.**
The ground truth function belongs to the Hölder space with a positive integer and .
Assumption 4**.**
The noise ’s are i.i.d. sub-Gaussian with and variance proxy , which are independent of the ’s.
3.1 Universal Approximation Theory
An accurate estimation of the nonparametric regression function necessitates the existence of a good approximation of by our learning models — neural networks. To aid the choice of a proper neural network class for learning , we first investigate the following questions:
- •
Given a desired approximation error , does there exist a ReLU neural network which universally represents Hölder functions supported on ?
- •
If the answer is yes, what is the corresponding network architecture?
We provide a positive answer in the theorem below and defer the proof to Section 4.
Theorem 1**.**
Suppose Assumptions 1 and 2 hold. Given any , there exists a ReLU network structure , such that, for any satisfying Assumption 3, if the weight parameters of the network are properly chosen, the network yields a function satisfying Such a network has
no more than layers, with width bounded by , 2. 2.
at most neurons and weight parameters, with the range of weight parameters bounded by ,
where depend on , , , , the surface area of , and the upper bounds on the derivatives of the coordinate systems ’s and the ’s in the partition of unity, up to order , and depends on the upper bound on the derivatives of the ’s, up to order .
This network class will be used later to estimate a regression function in Theorem 2. Our approximation theory does not require the output range to be bounded by in the network class (or equivalently by setting ). The enforcement of is to be imposed for regression in order to control the variance in statistical estimations.
The network structure identified by Theorem 1 consists of three sub-networks as shown in Figure 3 (The detailed construction of each sub-network is postponed to Section 4):
- •
Chart determination sub-network, which assigns each input to its corresponding neighborhood;
- •
Taylor approximation sub-network, which approximates by polynomials in each neighborhood;
- •
Pairing sub-network, which yields multiplications of the proper pairs of the outputs from the chart determination and the Taylor approximation sub-networks.
Theorem 1 significantly improves on existing approximation theories (Yarotsky, 2017), where the network size grows exponentially with respect to the ambient dimension , i.e. . Theorem 1 also improves Shaham et al. (2018) for functions in the case that . When , our network size scales like , which is significantly smaller than the one in Shaham et al. (2018) in the order of .
Our approximation theory can be directly generalized to the Sobolev space , which is embedded in . The reason is that our proof of Theorem 1 relies on local Taylor polynomial approximations of Hölder functions. For general Sobolev spaces , one needs to consider averaged Taylor polynomials and the Bramble-Hilbert lemma (Brenner and Scott, 2007, Lemma 4.3.8). We refer to Gühring et al. (2020) for readers’ interests.
Moreover, the size of our ReLU network in Theorem 1 matches the lower bound in DeVore et al. (1989) up to a logarithmic factor for the approximation of functions in the Hölder space defined on .
Proposition 2**.**
Fix and . Let be a positive integer and be any mapping. Suppose there is a continuous map such that for any . Then with depending on only.
We take as the parameter space of a ReLU network, and as the transformation given by the ReLU network. Theorem 2 implies that, to approximate any , the ReLU network needs to have at least weight parameters. Although Proposition 2 holds for functions defined on , our network size remains in the same order up to a logarithmic factor even when the function is supported on a manifold of dimension .
On the other hand, the lower bound also reveals that the low-dimensional manifold model plays a vital role to reduce the network size. To uniformly approximate functions in with an accuracy , the minimal number of weight parameters is . This lower bound cannot be improved without low-dimensional structures of data.
3.2 Statistical Estimation Theory
Based on Theorem 1, we next present our main regression theorem, which characterizes the convergence rate for the estimation of using ReLU neural networks.
Theorem 2**.**
Suppose Assumptions 1 - 3 hold. Let be the minimizer of empirical risk (3) with the network class properly designed such that
[TABLE]
Then we have
[TABLE]
where the expectation is taken over the training samples , and is a constant depending on , , , , , the surface area of , and the upper bounds of derivatives of the coordinate systems ’s and partition of unity ’s, up to order .
Theorem 2 is established by a bias-variance trade-off. We decompose the mean squared error to a squared bias term and a variance term. The bias is quantified by Theorem 1, and the variance term is proportional to the network size. A detailed proof of Theorem 2 is provided in Section 5. Here are some remarks:
The network class in Theorem 2 is sparsely connected, i.e. , while densely connected networks satisfy . 2. 2.
The network class has outputs uniformly bounded by . Such a requirement can be achieved by appending an additional clipping layer to the end of the network structure, i.e.,
[TABLE] 3. 3.
Each weight parameter in our network class is bounded by a constant only depending on the curvature , the range of the manifold , and the manifold dimension . Such a boundedness condition is crucial to our theory and can be computationally realized by normalization after each step of the stochastic gradient descent.
4 Proof of Approximation Theory
This section contains a proof sketch of Theorem 1. Before we proceed, we show how to approximate the multiplication operation using ReLU networks. This operation is heavily used in the Taylor approximation sub-network, since Taylor polynomials involve a sum of products. We first show ReLU networks can approximate quadratic functions.
Lemma 1** (Proposition in Yarotsky (2017)).**
The function with can be approximated by a ReLU network with any error . The network has depth and the number of neurons and weight parameters no more than with an absolute constant , and the width of the network is an absolute constant.
This lemma is proved in Appendix A.1. The idea is to approximate quadratic functions using a weighted sum of a series of sawtooth functions. Those sawtooth functions are obtained by compositing the triangular function
[TABLE]
which can be implemented by a single layer ReLU network.
We then approximate the multiplication operation by invoking the identity where the two squares can be approximated by ReLU networks in Lemma 1.
Corollary 1** (Proposition in Yarotsky (2017)).**
Given a constant and , there is a ReLU network which implements a function such that: 1). For all inputs and satisfying and , we have ; 2). The depth and the weight parameters of the network is no more than with an absolute constant .
The ReLU network in Theorem 1 is constructed in the following 5 steps.
Step 1. Construction of an atlas. Denote the open Euclidean ball with center and radius in by . For any , the collection is an open cover of . Since is compact, there exists a finite collection of points for such that
The following lemma says that when the radius is properly chosen, is diffeomorphic to .
Lemma 2**.**
Suppose Assumption 1 and 2 hold and let . Then the local neighborhood is diffeomorphic to . In particular, the orthogonal projection onto the tangent space at is a diffeomorphism.
The proof is provided in Appendix B.1, which utilizes the results in Niyogi et al. (2008). Therefore, we pick radius , and let be an atlas on as illustrated in Figure 4, where is
to be defined in Step 2. The number of charts is upper bounded by
[TABLE]
where is the surface area of , and is the thickness of the ’s, which is defined as the average number of ’s that contain a point on (See Eq. (1) in Chapter of Conway et al. (1987)).
Remark 2**.**
The thickness scales approximately linear in . As shown in Eq. (19) in Chapter of Conway et al. (1987), there exist coverings with .
Step 2. Projection with rescaling and translation. We denote the tangent space at as
[TABLE]
where form an orthonormal basis. We obtain the matrix by concatenating the ’s as column vectors.
Define
[TABLE]
for any , where is a scaling factor and is a translation vector. Since is bounded, we can choose proper and to guarantee . We rescale and translate the projection to ease the notation for the development of local Taylor approximations in Step 4. We also remark that each is a linear function, and can be realized by a single layer linear network.
Step 3. Chart determination. This step is to assign a given input to the proper charts to which belongs. This avoids projecting using unmatched charts (i.e., for some ) as illustrated in Figure 5.
An input can belong to multiple charts, and the chart determination sub-network determines all these charts. This can be realized by compositing an indicator function and the squared Euclidean distance
[TABLE]
for . The squared distance is a sum of univariate quadratic functions, thus, we can apply Lemma 1 to approximate by ReLU networks. Denote as an approximation of the quadratic function on with an approximation error . Then we define
[TABLE]
as an approximation of . The approximation error is , by the triangle inequality. We consider an approximation of the indicator function as in Figure 6:
[TABLE]
where () will be chosen later according to the accuracy .
To implement , we consider a basic step function . It is straightforward to check
[TABLE]
Let . It suffices to choose satisfying , which yields . We use to approximate the indicator function on :
- •
if , i.e., , we have ;
- •
if and , we have .
We remark that although the approximate indicator function is a piecewise linear function with two breakpoints, we implement it using a deep neural network to control the range of weight parameters in the network. Otherwise, the parameter upper bound can be as large as due to the steep slope in , which undermines the statistical theory.
Step 4. Taylor approximation. In each chart , we locally approximate using Taylor polynomials of order as shown in Figure 7. Specifically, we decompose as
[TABLE]
where is an element in a partition of unity on which is supported inside . The existence of such a partition of unity is guaranteed by Proposition 1. Since is a compact smooth manifold and is , preserves the regularity (smoothness) of such that for .
Lemma 3**.**
Suppose Assumption 3 holds. For , the function is Hölder continuous on , in the sense that there exists a Hölder coefficient depending on the upper bounds of derivatives of the partition of unity and coordinate system , up to order , such that for any , we have
[TABLE]
Proof Sketch.
We provide a sketch here. More details are deferred to Appendix B.2. Without loss of generality, suppose Assumption 3 holds with the atlas chosen in Step 1. Denote and . By the Leibniz rule, we have
[TABLE]
Consider each term in the sum: for any ,
[TABLE]
Here and are uniform upper bounds on the derivatives of and with order up to , respectively. The quantities and in the last inequality above is chosen as follows: by the mean value theorem, we have
[TABLE]
where the last inequality is due to the fact that . Then we set and by a similar argument, we set . We complete the proof by taking . ∎
Lemma 3 is crucial for the error estimation in the local approximation of by Taylor polynomials. This error estimate is given in the following theorem, where some of the proof techniques are from Theorem in Yarotsky (2017).
Theorem 3**.**
Let as in Step 4. For any , there exists a ReLU network structure that, if the weight parameters are properly chosen, the network yields an approximation of uniformly with an error . Such a network has
no more than layers, with width bounded by , 2. 2.
at most neurons and weight parameters, with the range of weight parameters bounded by ,
where depend on , , and the upper bound of derivatives of up to order , and depends on the upper bound of the derivatives of ’s up to order .
Proof Sketch.
The detailed proof is provided in Appendix B.3. The proof consists of two steps:
Approximate using a weighted sum of Taylor polynomials; 2. 2.
Implement the weighted sum of Taylor polynomials using ReLU networks.
Specifically, we set up a uniform grid and divide into small cubes, and then approximate by its -th order Taylor polynomial in each cube. To implement such polynomials by ReLU networks, we recursively apply the multiplication operator in Corollary 1, since these polynomials are sums of the products of different variables. ∎
Step 5. Estimating the total error. We have collected all the ingredients to implement the entire ReLU network to approximate on . Recall that the network structure consists of 3 main sub-networks as demonstrated in Figure 3. Let be an approximation to the multiplication operator in the pairing sub-network with error . Accordingly, the function given by the whole network is
[TABLE]
where is the approximation of using Taylor polynomials in Theorem 3. The total error can be decomposed into three components according to Lemma 4 below. We denote as the indicator function of . Let the approximation errors of the multiplication operation and the local Taylor polynomial in Theorem 3 be and , respectively.
Lemma 4**.**
For any , we have , where
[TABLE]
Lemma 4 is proved in Appendix B.4. In order to achieve an total approximation error, i.e., , we need to control the errors in the three sub-networks. In other words, we need to decide for , for , for , and for . Note that is the error from the pairing sub-network, is the approximation error in the Taylor approximation sub-network, and is the error from the chart determination sub-network. The error bounds on are straightforward from the constructions of and . The estimate of involves some technical analysis since . Note that we have
[TABLE]
whenever or . Therefore, we only need to prove that is sufficiently small in the shell region
[TABLE]
We bound the maximum of on using a first-order Taylor expansion. Since vanishes at the boundary of due to the partition of unity , we can show that is proportional to the width of . In particular, there exists a constant depending on ’s and ’s such that
[TABLE]
Then (6) immediately implies the upper bound on . The formal statement of (6) and its proof are deferred to Lemma 8 and Appendix B.5.
Given Lemma 4, we choose
[TABLE]
so that the approximation error is bounded by . Moreover, we choose
[TABLE]
to guarantee so that the definition of is valid.
Finally we quantify the size of the ReLU network. Recall that the chart determination sub-network has layers, the Taylor approximation sub-network has layers, and the pairing sub-network has layers. Here depends on , and are absolute constants. Combining these with (7) and (8) yields the depth in Theorem 1. By a similar argument, we can obtain the number of neurons and weight parameters. A detailed analysis is given in Appendix B.6.
5 Proof of the Statistical Estimation Theory
In the proof of Theorem 2, we decompose the mean squared error of the estimator into a squared bias term and a variance term. We bound the bias and variance separately, where the bias is tackled using the approximation theory (Theorem 1), and the variance is bounded using the metric entropy arguments (van der Vaart and Wellner, 1996; Györfi et al., 2006). We begin with an oracle-type decomposition of the risk, in which we introduce the empirical risk as the intermediate term:
[TABLE]
where reflects the squared bias of using neural networks for estimating and is the variance term. We slightly abuse the notation to denote the index of samples.
5.1 Bias Characterization — Bounding
Since is the empirical risk of evaluated on the samples , we relate to the empirical risk (3) by rewriting . Substituting into , we derive the following decomposition,
[TABLE]
Equality is obtained by expanding the square, where the cross term due to the independence between and . Inequality invokes the Jensen’s inequalty to pass the expectation. To obtain term , we expand , and observe the cancellation of . Note that term is the squared approximation error of neural networks, and we will tackle it later using Theorem 1. We bound term by quantifying the complexity of the network class . A precise upper bound of is given in the following lemma, whose proof follows a similar argument in Schmidt-Hieber (2017, Lemma 4).
Lemma 5**.**
Fix the neural network class . For any constant , we have
[TABLE]
where denotes the -covering number of with respect to the norm, i.e., there exists a discretization of into distinct elements, such that for any , there is in the discretization satisfying .
Proof Sketch.
Given the derivation in (9), we need to bound term . We discretize the neural network class as . By the definition of covering, there exists such that . Denoting , we cast into
[TABLE]
where follows from Hölder’s inequality and is obtained by some algebraic manipulation. To break the dependence between and the samples, we replace by any in the -covering and observe that . Applying the Cauchy-Schwarz inequality, we can show
[TABLE]
where . Given , we note that is a sub-Gaussian random variable with parameter (i.e., its variance is bounded by ). It is well established in the existing literature on empirical processes (van der Vaart and Wellner, 1996) that the maximum of a collection of squared sub-Gaussian random variables satisfies
[TABLE]
Substituting the above inequality into and combining and , we have
[TABLE]
Some manipulation gives rise to the desired result
[TABLE]
See proof details in Appendix C.1. ∎
5.2 Variance Characterization — Bounding
We observe that is the difference between the population risk of and its empirical counterpart. However, bounding such a difference is distinct from conventional concentration results due to the scaling factor before the empirical risk. In particular, we split the empirical risk evenly into two parts, and bound one part using its higher-order moment (fourth moment). Using Bernstein-type inequality allows us to establish a convergence rate of ; the corresponding upper bound is presented in the following lemma.
Lemma 6**.**
For any constant , satisfies
[TABLE]
Proof Sketch.
The detailed proof is deferred to Appendix C.2. For notational simplicity, we denote and . Applying the inequality (Barron, 1991), we rewrite as
[TABLE]
We now utilize ghost samples of to bound , which is a common technique in existing literature on nonparametric statistics (van der Vaart and Wellner, 1996; Györfi et al., 2006). Specifically, let ’s be independent replications of ’s. We bound as
[TABLE]
where . We use the shorthand to denote the double integral with respect to the joint distribution of . The last inequality holds due to Jensen’s inequality. Note here contributes as the variance term of , which yields a fast convergence of as grows.
Similar to bounding , we discretize the function space using a -covering denoted by . This allows us to replace the supremum by the maximum over a finite set:
[TABLE]
We can bound the above maximum by the Bernstein’s inequality, which yields
[TABLE]
The last step is to relate the covering number of to that of . Specifically, consider any with and , respectively. We can derive
[TABLE]
Therefore, the inequality holds, which implies
[TABLE]
The proof is complete. ∎
5.3 Covering Number of Neural Networks
The upper bounds of and in Lemmas 5 and 6 both depend on the covering number of the network class . In this section, we provide an upper bound on the covering number for a given a resolution . Since each weight parameter in the network is bounded by a constant , we construct a covering by partitioning the range of each weight parameter into a uniform grid. By choosing a proper grid size, we show the following lemma.
Lemma 7**.**
Given , the -covering number of the neural network class satisfies
[TABLE]
Proof Sketch.
Consider with each weight parameter differing at most . By an induction on the number of layers in the network, we show that the norm of the difference scales as
[TABLE]
As a result, to achieve a -covering, it suffices to choose such that . Moreover, there are different choices of non-zero entries out of weight parameters. Therefore, the covering number is bounded by
[TABLE]
The detailed proof is provided in Appendix C.3. ∎
5.4 Bias-Variance Trade-off
We are ready to finish the proof of Theorem 2. Combining the upper bounds of in Lemma 5 and in Lemma 6 together and substituting the covering number (10), we obtain
[TABLE]
It suffices to choose , which gives rise to
[TABLE]
where we also plug in the covering number upper bound in Lemma 10. We further set the approximation error as , i.e., . Theorem 1 suggests that we choose , , and . Substituting , , and into (5.4), we have
[TABLE]
To balance the error terms, we pick satisfying , which gives . The proof of Theorem 2 is complete by plugging in and rearranging the terms.
6 Conclusion
We study nonparametric regression of functions supported on a -dimensional Riemannian manifold isometrically embedded in , using deep ReLU neural networks. Our result establishes an efficient statistical estimation theory for general regression functions including and Hölder functions supported on manifolds. We show that the loss for the estimation of converges in the order of . To obtain an -error for the estimation of , the sample complexity scales in the order of . This sample complexity depends on the intrinsic dimension , and demonstrates that deep neural networks are adaptive to low-dimensional geometric structures of data sets. Such results can be viewed as theoretical justifications for the empirical success of deep learning in various real-world applications where the data sets exhibit low-dimensional structures.
Acknowledgment
This work was supported by NSF DMS , NSF DMS 2012652, and NSF IIS-1717916.
Appendix A Proofs of the Preliminary Results in Section 4
A.1 Proof of Lemma 1
Proof.
We partition the interval uniformly into subintervals for . We approximate on these subintervals by a linear interpolation
[TABLE]
It is straightforward to check that meets at the endpoints of .
We evaluate the approximation error of on the interval :
[TABLE]
Note that this approximation error does not depend on . Thus, in order to achieve an approximation error, we only need
[TABLE]
Since , we let and denote . We compute the increment from to for as
[TABLE]
We observe that is a triangular function on . The maximum is independent of attained at . The minimum is [math] attained at the endpoints . To implement , we consider a triangular function representable by a one-layer ReLU network:
[TABLE]
Denote by the composition of totally functions . Observe that is a sawtooth function with peaks at for , and we have for . Then we have . By induction, we have
[TABLE]
Therefore, can be implemented by a ReLU network of depth . Meanwhile, each layer consists of at most 3 neurons. Hence, the total number of neurons and weight parameters is no more than for an absolute constant . ∎
A.2 Proof of Corollary 1
Proof.
Let be an approximation of the quadratic function on with error . We set
[TABLE]
Now we determine . We bound the error of
[TABLE]
Thus, we pick to ensure for any inputs and . As shown in Lemma 1, we can implement using a ReLU network of depth at most with absolute constants . The proof is complete. ∎
Appendix B Proof of Approximation Theory of ReLU Network (Theorem 1)
This section consists of the detailed proofs of Lemma 2, Lemma 3, local approximation theory Theorem 3, error decomposition in Lemma 4 and a technical Lemma 8 for bounding the error, as well as the configuration of the desired ReLU network class for universally approximating Hölder functions.
B.1 Proof of Lemma 2
Proof.
We first show defined on is a homeomorphism, which implies is a chart on the manifold. Then by Proposition 6.10 in Tu (2010), we conclude is a diffeomorphism.
To show is a homeomorphism on , we only need to show has a continuous inverse. By Lemma 5.4 in Niyogi et al. (2008), the derivative of is nonsingular in . The inverse function theorem implies that is locally invertible in an open neighborhood for some constant . In the following, we show by contradiction that the constant . Suppose not, there exist distinct points such that with and . Using the triangle inequality, we obtain . Applying Proposition 6.3 in Niyogi et al. (2008), we derive
[TABLE]
Furthermore, using Proposition 6.2 in Niyogi et al. (2008), we lower bound the angle between the tangent spaces and by
[TABLE]
On the other hand, we consider a unit speed geodesic starting from and ending at , with , , and . Integration by parts yields
[TABLE]
Rearranging terms gives rise to
[TABLE]
where the last inequality follows from Proposition 6.1 in Niyogi et al. (2008). Dividing (B.2) by and plugging in , we have
[TABLE]
For any unit vector , we evaluate the inner product
[TABLE]
where in equality , since by our assertion. Combining (B.1) and (B.1), we obtain
[TABLE]
which is a contradiction. Therefore, we conclude that is injective, and hence invertible on the local neighborhood . The continuity of follows from its definition, and the inverse map of a continuous map is also continuous. Therefore, is a homeomorphism on for .
The last step is to show is also a diffeomorphism. We leverage the following proposition.
Proposition 3** (Proposition 6.10 in Tu (2010)).**
If is a chart on a manifold , then the coordinate map is a diffeomorphism.
Since is a homeomorphism, we deduce that is a chart of . Applying Proposition 3, we conclude that is a diffeomorphism. ∎
B.2 Proof of Lemma 3
Proof.
Recall that we choose local coordinate neighborhood in Step 1 in Section 4. Let be the projection onto the tangent space . Then is an atlas of . Without loss of generality, we assume that verifies the Hölder condition in Definition 5. Now we rewrite as
[TABLE]
By the definition of the partition of unity, we know is . This implies that is continuously differentiable. Since is compact, the -th derivative of is uniformly bounded by for any . Let . We have for any and ,
[TABLE]
The last inequality follows from and . Observe that is bounded, hence, we have . Absorbing into , we have the derivative of is Hölder continuous. We denote . Similarly, is by Assumption 3. Then there exists a constant such that the -th derivative of is uniformly bounded by for any . These derivatives are also Hölder continuous with coefficient .
By the Leibniz rule, for any , we expand the -th derivative of as
[TABLE]
Consider each summand in the above right-hand side. For any , we derive
[TABLE]
Observe that there are totally summands in the right hand side of (B.4). Therefore, for any and , we have
[TABLE]
∎
B.3 Proof of Theorem 3
Proof.
The proof consists of two steps. We first approximate by a Taylor polynomial, and then implement the Taylor polynomial using a ReLU network. To ease the analysis, we extend to the whole cube by assigning for . It is straightforward to check that this extension preserves the regularity of , since vanishes on the complement of the compact set . For notational simplicity, we denote with the extension. Accordingly, Lemma 3 can be extended to the whole cube without changing its proof, i.e., for any and , we have
[TABLE]
Step 1. We define a trapezoid function
[TABLE]
Note that we have . Let be a positive integer, we form a uniform grid on by dividing each coordinate into subintervals. We then consider a partition of unity on these grid defined by
[TABLE]
We can check that as in Figure 8.
We also observe that . We use the slightly enlarged support set of length to simplify the constant computation. Now we construct a Taylor polynomial of degree for approximating at :
[TABLE]
Define . We bound the approximation error :
[TABLE]
Here is the linear interpolation of and , determined by the Taylor remainder, and inequality follows from the Taylor expansion of around . Note that only -th order derivative remains in step and there are at most terms. Inequality is obtained by the Hölder continuity in the inequality (B.5).
By setting
[TABLE]
we get . Accordingly, the approximation error is bounded by .
Step 2. We next implement by a ReLU network that approximates up to an error . We denote
[TABLE]
where a_{\mathbf{m},\mathbf{s}}=\frac{D^{\mathbf{s}}f_{i}^{\phi}}{\mathbf{s}!}\bigg{|}_{\mathbf{x}=\frac{\mathbf{m}}{N}}. Then we rewrite as
[TABLE]
Note that (B.6) is a linear combination of products . Each product involves at most univariate terms: terms for and terms for . We recursively apply Corollary 1 to implement the product. Specifically, let be the approximation of the product operator in Corollary 1 with error , which will be chosen later. Consider the following chain application of :
[TABLE]
Now we estimate the error of the above approximation. Note that we have and for all and . We then have
[TABLE]
Moreover, we have , if . Now we define
[TABLE]
The approximation error is bounded by
[TABLE]
We choose , so that . Thus, we eventually have . Now we compute the depth and computational units for implement . can be implemented by a collection of parallel sub-networks that compute each . The total number of parallel sub-networks is bounded by . For each sub-network, we observe that can be exactly implemented by a single layer ReLU network, i.e., . Corollary 1 shows that can be implemented by a depth ReLU network. Therefore, the whole network for implementing has no more than layers with width bounded by and neurons and weight parameters. With and N=\Big{\lceil}\big{(}\frac{\mu_{i}\lambda_{i}(2r)^{1-\alpha}2^{d+s+2}d^{s+\alpha/2}}{\delta s!}\big{)}^{\frac{1}{s+\alpha}}\Big{\rceil}, we obtain that the whole network has no more than layers, with width bounded by , and at most neurons and weight parameters, for constants depending on , and upper bound of derivatives of , up to order . Lastly, from (B.6), we see each parameter has a range bounded by the upper bound of derivatives of up to order — scales as as in (B.5). ∎
B.4 Proof of Lemma 4
Proof.
We expand the estimation error as
[TABLE]
The first two terms are straightforward to handle, since by the construction we have
[TABLE]
By Lemma 8, we have for a constant depending on . Then we bound as
[TABLE]
∎
B.5 Helper Lemma for Bounding and Its Proof
Lemma 8**.**
For any , denote
[TABLE]
Then there exists a constant depending on the upper bounds of the first derivatives of the partition of unity ’s and coordinate system ’s such that
[TABLE]
Proof.
We extend to the whole cube as in the proof of Theorem 3. We also have for . By the first order Taylor expansion, for any , we have
[TABLE]
where is a linear interpolation of and satisfying the mean value theorem. Since is in , the first derivative is uniformly bounded, i.e., for any . Let satisfying . In order to bound the function value for any , we only need to bound the Euclidean distance between and . More specifically, for any , we need to show that there exists satisfying , such that is sufficiently small.
Before continuing with the proof, we introduce some notations. Let be a geodesic on parameterized by the arc length. In the following context, we use and to denote the first and second derivatives of with respect to . By the definition of geodesic, we have (unit speed) and .
Without loss of generality, we shift to . We consider a geodesic starting from with initial “velocity” in the tangent space of at . To utilize polar coordinate, we define two auxiliary quantities: and . As can be seen in Figure 9, and have clear geometrical interpretations: is the radial distance from the center , and is the angle between the velocity and .
Suppose , we need to upper bound . Note that . Moreover, observe that the derivative of is , since has unit speed. It suffices to find a lower bound on so that .
We immediately have the second derivative of as . Meanwhile, using the equation , we also have
[TABLE]
Note that by definition, we have and . Plugging into (B.7), we can derive
[TABLE]
Now we find a lower bound on . Specifically, by Cauchy-Schwarz inequality, we have
[TABLE]
The last inequality follows from (Niyogi et al., 2008) and . We now need to bound , given and . Consider the following optimization problem,
[TABLE]
By assigning and , the optimal objective value is exactly the minimum of . Additionally, we can find the maximum of by replacing the minimization in (B.9) by maximization. We solve (B.9) by the Lagrangian method. More precisely, let
[TABLE]
We have the optimal solution satisfying , which implies with and being the optimal dual variable. By the primal feasibility, we have and . Therefore, the optimal objective value is . Similarly, the maximum is . Note that , we then get
[TABLE]
Substituting into (B.8), we have the following lower bound
[TABLE]
Now combining with , we can derive
[TABLE]
Inequality (B.10) has an important implication: When , as increasing, is monotone decreasing until for some . Thus, we distinguish two cases depending on the value of . Indeed, we only need to consider . The reason behind is that if , we only need to set the initial velocity in the opposite direction.
Case 1: . We claim that for all . In fact, suppose there exists some such that . By the continuity of , there exists , such that and for . This already gives us a contradiction:
[TABLE]
Therefore, we have , and thus .
Case 2: \theta(0)\in\big{(}\arcsin\frac{r}{\tau},\pi/2\big{]}. It is enough to show that can be bounded sufficiently away from . Let be a geodesic from to . We analogously define and as for the geodesic from to . Let , and denote . We must have and , otherwise there exists satisfying . Denote satisfying . We bound as follows,
[TABLE]
If there exists some such that , by the previous reasoning, we have . Thus, we only need to handle the case when for all . In this case, is monotone decreasing, hence we further have
[TABLE]
The last inequality follows from . Using the fact, , we can derive
[TABLE]
We can then set , and thus
[TABLE]
Therefore, we have . By the choice of , we immediately have . Hence, combining case 1 and case 2, we conclude
[TABLE]
Therefore, the function value on is bounded by . It suffices to set , and we complete the proof. ∎
B.6 Characterization of the Size of the ReLU Network
Proof.
We evenly split the error into parts for , and , respectively. We pick so that . The same argument yields . Analogously, we can choose . Finally, we pick so that .
Now we compute the number of layers, width, the number of neurons and weight parameters, and the range of each weight parameter in the ReLU network identified by Theorem 1.
For the chart determination sub-network, can be implemented by a ReLU network with layers and neurons in each layer. The weight parameters in the network is bounded by . The approximation of the distance function can be implemented by a network of depth , width bounded by a constant, and the number of neurons and weight parameters is at most . Each weight parameter is bounded by . Plugging in our choice of and , we have the depth is no greater than with depending on , and the surface area of . The number of neurons and weight parameters is also except for a different constant. Note that there are parallel networks computing for . Hence, the total number of neurons and weight parameters is with depending on , and the surface area of . As can be seen, the width of the chart-determination network is bounded by , and the weight parameter is bounded by . 2. 2.
For the Taylor polynomial sub-network, can be implemented by a linear network with at most weight parameters. To implement each , we need a ReLU network of depth . The number of neurons and weight parameters is , and the width is bounded by . Here depend on . In addition, all the weight parameters are bounded by the upper bound of the derivatives of up to order (which scales as as in Lemma 3). Substituting , we get the depth is and the number of neurons and weight parameters is . There are totally parallel ’s, hence the width is further bounded by . Meanwhile, the total number of neurons and weight parameters is . Here constants and depend on , and the surface area of . 3. 3.
For the product sub-network, the analysis is similar to the chart determination sub-network. The depth is , the width is bounded by a constant, he number of neurons and weight parameters is , and all the weight parameters are bounded by a constant. The choice of yields that the depth is , and the number of neurons and weight parameters is . There are parallel pairs of outputs from the chart determination and the Taylor polynomial sub-networks. Hence, the total number of weight parameters is with depending on , and the surface area of .
Combining these 3 sub-networks, and redefining the constants , , and in the sequel, we obtain that the depth of the full network is for some constant depending on , and the surface area of . The depth of the neural network is bounded by with depending on , the surface area of , and the upper bounds on derivatives of ’s and ’s, up to order . The total number of neurons and weight parameters is for some constant depending on , and the surface area of . Lastly, all the weight parameters in the network is bounded by with depends on the upper bound of derivatives of ’s up to order . ∎
Appendix C Proof of Statistical Recovery of ReLU Network (Theorem 2)
This section consists of the detailed proofs, in Section C.1, C.2 and C.3, respectively, for upper bounding bias in Lemma 5, upper bounding variance in Lemma 6, and upper bounding covering number in Lemma 7. Lastly, the statistical bound in Theorem 2 is established in Section C.4 by choosing a proper approximation error and covering accuracy via the bias-variance trade-off argument.
C.1 Proof of Lemma 5
Proof.
essentially reflects the bias of estimating :
[TABLE]
where follows from due to the independence between and , and follows from Jensen’s inequality. Now we need to bound . We discretize the class into , where denotes the -covering number with respect to the norm. Accordingly, there exists such that . Denote . Then we have
[TABLE]
Here is obtained by applying Hölder’s inequality to and invoking the Jensen’s inequality:
[TABLE]
Step holds, since by invoking the inequality , we have
[TABLE]
To bound the expectation term in (C.2), we first break the dependence between and the samples . In detail, we replace by any in the -covering, and observe that . For notational simplicity, we denote . Applying Cauchy-Schwarz inequality, we cast the expectation term in (C.2) as
[TABLE]
For given , each term is sub-guassian with parameter . Consequently, the last inequality (C.3) involves the maximum of a collection of squared sub-Gaussian random variables . Indeed, is sub-exponential for each . We can bound it using the moment generating function: For any , we have
[TABLE]
Since is -sub-Gaussian given , we derive
[TABLE]
Taking and substituting into (C.4), we deduce is bounded by
[TABLE]
Combining (C.5), (C.3), (C.2), and substituting back into (C.1), we obtain the following implicit error estimation on :
[TABLE]
We denote . Then the above implicit bound on implies
[TABLE]
Rearranging (C.6) for , we deduce . Some manipulation then yields , which implies
[TABLE]
The proof is complete. ∎
C.2 Proof of Lemma 6
Proof.
Recall that we denote . We rewrite as
[TABLE]
We lower bound by its second moment:
[TABLE]
The last inequality follows from . Now we cast into
[TABLE]
Introducing the second moment allows us to establish a fast convergence of . Specifically, we denote ’s as independent copies of ’s following the same distribution. We also denote
[TABLE]
as the function class induced by . Then we upper bound (C.7) as
[TABLE]
where follows from Jensen’s inequality and shorthand denotes the expectation (double integral ) with respect to the joint distribution of .
We discretize with respect to the norm. The -covering number is denoted as and the elements in the covering is denoted as , that is, for any , there exists a satisfying .
We replace by in bounding , which then boils down to deriving concentration results on a finite concept class. Specifically, for satisfying , we have
[TABLE]
We also have
[TABLE]
Plugging the above two items into (C.8), we upper bound as
[TABLE]
Denote . By symmetry, it is straightforward to see . The variance of is computed as
[TABLE]
The last inequality utilizes the identity . Therefore, we derive the following upper bound for ,
[TABLE]
We invoke the moment generating function to bound . Note that we have . Then by Taylor expansion, for and any , we have
[TABLE]
Step follows from the fact for . Given (C.2), we proceed to bound . To ease the presentation, we temporarily neglect term and denote . Then for , we have
[TABLE]
Step follows from Jensen’s inequality, and step invokes (C.2) for each . We now choose so that , which yields . Substituting our choice of into , we have
[TABLE]
To complete the proof, we relate the covering number of to that of . Consider any with and , respectively, for . We can derive
[TABLE]
The above characterization immediately implies . Therefore, we derive the desired upper bound on :
[TABLE]
∎
C.3 Proof of Lemma 7
Proof.
To construct a covering for , we discretize each weight parameter by a uniform grid with grid size . Recall we write as . Let with all the weight parameters at most from each other. Denoting the weight matrices in as and , respectively, we bound the difference as
[TABLE]
We derive the following bound on :
[TABLE]
where is obtained by induction and . The last inequality holds, since . Substituting back into the bound for , we have
[TABLE]
where is obtained by induction. We choose satisfying . Then discretizing each parameter uniformly into grid points yields a -covering on . Note that there are different choices of non-zero entries out of total weight parameters. Therefore, the covering number is upper bounded by
[TABLE]
∎
C.4 Proof of Theorem 2 — Bias-Variance Trade-off
Proof.
We recall the bias and variance decomposition of as
[TABLE]
Combining the upper bounds on and in Lemmas 5 and 6, we can derive
[TABLE]
By our choice of , there exists a network class which can yield a function satisfying for . We will choose later for the bias-variance trade-off. Such a network consists of layers and weight parameters. Invoking the upper bound of the covering number in Lemma 7, we derive
[TABLE]
Now we choose to satisfy , which gives . It suffices to pick . Substitute both and into (C.10), we deduce the desired estimation error bound
[TABLE]
where constant depends on depending on , , , , , the surface area of , and the upper bounds of derivatives of the coordinate systems ’s and partition of unity ’s, up to order . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aamari et al. (2019) Aamari, E. , Kim, J. , Chazal, F. , Michel, B. , Rinaldo, A. and Wasserman, L. (2019). Estimating the reach of a manifold. Electron. J. Stat. , 13 1359–1399.
- 2Allard et al. (2012) Allard, W. K. , Chen, G. and Maggioni, M. (2012). Multi-scale geometric methods for data sets ii: Geometric multi-resolution analysis. Appl. Comput. Harmon. Anal. , 32 435–462.
- 3Altman (1992) Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. Amer. Statist. , 46 175–185.
- 4Amodei et al. (2016) Amodei, D. , Ananthanarayanan, S. , Anubhai, R. , Bai, J. , Battenberg, E. , Case, C. , Casper, J. , Catanzaro, B. , Cheng, Q. , Chen, G. et al. (2016). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning . PMLR.
- 5Bahdanau et al. (2014) Bahdanau, D. , Cho, K. and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473 .
- 6Barron (1991) Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In Nonparametric functional estimation and related topics . Springer, 561–576.
- 7Barron (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory , 39 930–945.
- 8Bickel and Li (2007) Bickel, P. J. and Li, B. (2007). Local polynomial regression on unknown manifolds. Lecture Notes-Monograph Series , 54 177–186.
