Better Approximations of High Dimensional Smooth Functions by Deep   Neural Networks with Rectified Power Units

Bo Li; Shanshan Tang; Haijun Yu

arXiv:1903.05858·math.NA·February 28, 2020

Better Approximations of High Dimensional Smooth Functions by Deep Neural Networks with Rectified Power Units

Bo Li, Shanshan Tang, Haijun Yu

PDF

TL;DR

This paper demonstrates that deep neural networks with rectified power units (RePU) can approximate smooth functions more efficiently than ReLU networks, requiring smaller network sizes and offering better stability and approximation properties.

Contribution

The paper introduces a novel approach using RePU activations for better approximation of smooth functions, with constructive algorithms and theoretical analysis showing improved efficiency over ReLU networks.

Findings

01

RePU networks require $ ext{O}( ext{log}(1/\varepsilon))$ smaller sizes than ReLU networks for the same accuracy.

02

RePU networks are numerically more stable and use fewer activation functions than classical methods.

03

RePU networks naturally fit smooth functions involving derivatives, enhancing their application in derivative-based loss functions.

Abstract

Deep neural networks with rectified linear units (ReLU) are getting more and more popular due to their universal representation power and successful applications. Some theoretical progress regarding the approximation power of deep ReLU network for functions in Sobolev space and Korobov space have recently been made by [D. Yarotsky, Neural Network, 94:103-114, 2017] and [H. Montanelli and Q. Du, SIAM J Math. Data Sci., 1:78-92, 2019], etc. In this paper, we show that deep networks with rectified power units (RePU) can give better approximations for smooth functions than deep ReLU networks. Our analysis bases on classical polynomial approximation theory and some efficient algorithms proposed in this paper to convert polynomials into deep RePU networks of optimal size with no approximation error. Comparing to the results on ReLU networks, the sizes of RePU networks required to approximate…

Figures6

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1: Representation of monomials x n superscript 𝑥 𝑛 x^{n} .

Degree $n$	$L$	$# weight$	$# node$	$L^{\infty}$ -Error
3	3	38	10	4.44e-16
7	4	64	15	2.22e-16
15	5	89	20	9.99e-16
31	6	114	25	7.77e-16
63	7	139	30	6.11e-16
127	8	164	35	2.22e-16

Table 2. Table 2: Representation of univariate polynomials of degree n 𝑛 n .

Degree $n$	$L$	$# Weight$	$# Node$	$L^{\infty}$ -Error
3	3	66	14	1.78e-15
7	4	188	31	1.78e-15
15	5	429	64	4.44e-15
31	6	910	129	5.33e-15
63	7	1871	258	5.33e-15
127	8	3792	515	5.33e-15

Table 3. Table 3: Representation of polynomials in tensor-product space Q N 2 superscript subscript 𝑄 𝑁 2 Q_{N}^{2} .

Degree $N$	$L$	$# Weight$	$# Node$	$L^{\infty}$ -Error
3	5	378	64	1.11e-15
7	7	1570	246	8.88e-15
15	9	6376	988	1.60e-14
31	11	25758	4002	7.11e-14
63	13	103668	16168	8.88e-14

Table 4. Table 4: Representation of polynomials in hyperbolic cross polynomial space.

Degree $N$	$L$	$# Weight$	$# Node$	$L^{\infty}$ -Error
7	7	1254	217	3.55e-15
15	9	3277	554	1.24e-14
31	11	8022	1351	5.32e-14
63	13	19039	3196	2.24e-14
127	15	44052	7393	4.26e-14

Equations220

σ_{s} (x) = {x^{s}, 0, x \geq 0, x < 0,, s \in N_{0},

σ_{s} (x) = {x^{s}, 0, x \geq 0, x < 0,, s \in N_{0},

\Phi=\big{(}(A_{1},b_{1}),\cdots,(A_{L},b_{L})\big{)},

\Phi=\big{(}(A_{1},b_{1}),\cdots,(A_{L},b_{L})\big{)},

R_{ρ} (Φ) : R^{d} \to R^{N_{L}}, R_{ρ} (Φ) (x) = x_{L},

R_{ρ} (Φ) : R^{d} \to R^{N_{L}}, R_{ρ} (Φ) (x) = x_{L},

⎩ ⎨ ⎧ x_{0} := x, x_{k} := ρ (A_{k} x_{k - 1} + b_{k}), x_{L} := A_{L} x_{L - 1} + b_{L}, k = 1, 2, \dots, L - 1,

⎩ ⎨ ⎧ x_{0} := x, x_{k} := ρ (A_{k} x_{k - 1} + b_{k}), x_{L} := A_{L} x_{L - 1} + b_{L}, k = 1, 2, \dots, L - 1,

ρ (y) := (ρ (y^{1}), \dots, ρ (y^{m})), \forall y = (y^{1}, \dots, y^{m}) \in R^{m} .

ρ (y) := (ρ (y^{1}), \dots, ρ (y^{m})), \forall y = (y^{1}, \dots, y^{m}) \in R^{m} .

x^{2}

x^{2}

x

x y

β_{2} = [1, 1]^{T}, ω_{2} = [1, - 1]^{T}, β_{1} = \frac{1}{4} [1, 1, - 1, - 1]^{T}, ω_{1} = [1, - 1, 1, - 1]^{T}, γ_{1} = [1, - 1, - 1, 1]^{T} .

β_{2} = [1, 1]^{T}, ω_{2} = [1, - 1]^{T}, β_{1} = \frac{1}{4} [1, 1, - 1, - 1]^{T}, ω_{1} = [1, - 1, 1, - 1]^{T}, γ_{1} = [1, - 1, - 1, 1]^{T} .

x^{2}

x^{2}

x y

β_{3} = \frac{1}{4} [1, - 1, - 1]^{T}, ω_{3} = [1, 1, - 1]^{T}, γ_{2} = [1, - 1, 1]^{T} .

β_{3} = \frac{1}{4} [1, - 1, - 1]^{T}, ω_{3} = [1, 1, - 1]^{T}, γ_{2} = [1, - 1, 1]^{T} .

x = (x + 1/2)^{2} - x^{2} - 1/4 = β_{2}^{T} σ_{2} (ω_{2} (x + 1/2)) - β_{2}^{T} σ_{2} (ω_{2} x) - 1/4,

x = (x + 1/2)^{2} - x^{2} - 1/4 = β_{2}^{T} σ_{2} (ω_{2} (x + 1/2)) - β_{2}^{T} σ_{2} (ω_{2} x) - 1/4,

x = (x + 1/2)^{2} - x^{2} - 1/4 = σ_{2} (x + 1/2) - σ_{2} (x) - 1/4,

x = (x + 1/2)^{2} - x^{2} - 1/4 = σ_{2} (x + 1/2) - σ_{2} (x) - 1/4,

f_{N} (x) = k = 1 \sum N c_{k} σ_{2} (a_{k} x + b_{k}) + d,

f_{N} (x) = k = 1 \sum N c_{k} σ_{2} (a_{k} x + b_{k}) + d,

n = a_{m} \cdot 2^{m} + a_{m - 1} \cdot 2^{m - 1} + \dots + a_{1} \cdot 2 + a_{0},

n = a_{m} \cdot 2^{m} + a_{m - 1} \cdot 2^{m - 1} + \dots + a_{1} \cdot 2 + a_{0},

x^{n} = x^{2^{m}} \cdot x^{j = 0 \sum m - 1 a_{j} 2^{j}} .

x^{n} = x^{2^{m}} \cdot x^{j = 0 \sum m - 1 a_{j} 2^{j}} .

ξ_{k}^{(1)} := x^{2^{k}}, ξ_{k}^{(2)} := x^{j = 0 \sum k - 1 a_{j} 2^{j}}, for 1 \leq k \leq m,

ξ_{k}^{(1)} := x^{2^{k}}, ξ_{k}^{(2)} := x^{j = 0 \sum k - 1 a_{j} 2^{j}}, for 1 \leq k \leq m,

x^{n} = ξ_{m}^{(1)} ξ_{m}^{(2)} .

x^{n} = ξ_{m}^{(1)} ξ_{m}^{(2)} .

{ξ_{1}^{(1)} = x^{2}, ξ_{1}^{(2)} = x^{a_{0}},

{ξ_{1}^{(1)} = x^{2}, ξ_{1}^{(2)} = x^{a_{0}},

x^{a_{j}}y=\Big{(}\dfrac{1+(-1)^{a_{j}}}{2}+\dfrac{1-(-1)^{a_{j}}}{2}x\Big{)}y=\beta^{T}_{1}\sigma_{2}\left(\omega_{1}(c^{+}_{j}+c^{-}_{j}x)+\gamma_{1}y\right),

x^{a_{j}}y=\Big{(}\dfrac{1+(-1)^{a_{j}}}{2}+\dfrac{1-(-1)^{a_{j}}}{2}x\Big{)}y=\beta^{T}_{1}\sigma_{2}\left(\omega_{1}(c^{+}_{j}+c^{-}_{j}x)+\gamma_{1}y\right),

ξ_{1}^{(1)} = x^{2} = β_{2}^{T} σ_{2} (ω_{2} x) \geq 0,

ξ_{1}^{(1)} = x^{2} = β_{2}^{T} σ_{2} (ω_{2} x) \geq 0,

ξ_{1}^{(2)} = x^{a_{0}} = c_{0}^{+} + c_{0}^{-} x = c_{0}^{+} + c_{0}^{-} β_{1}^{T} σ_{2} (ω_{1} x + γ_{1}),

x_{1} = σ_{2} (A_{1} x + b_{1}), \mbox w h er e A_{1} = [ω_{2} ω_{1}]_{6 \times 1}, b_{1} = [0 γ_{1}]_{6 \times 1},

x_{1} = σ_{2} (A_{1} x + b_{1}), \mbox w h er e A_{1} = [ω_{2} ω_{1}]_{6 \times 1}, b_{1} = [0 γ_{1}]_{6 \times 1},

[ξ_{1}^{(1)} ξ_{1}^{(2)}] = A_{20} x_{1} + b_{20}, where A_{20} = [β_{2}^{T} 0 0 c_{0}^{-} β_{1}^{T}]_{2 \times 6}, b_{20} = [0 c_{0}^{+}]_{2 \times 1} .

[ξ_{1}^{(1)} ξ_{1}^{(2)}] = A_{20} x_{1} + b_{20}, where A_{20} = [β_{2}^{T} 0 0 c_{0}^{-} β_{1}^{T}]_{2 \times 6}, b_{20} = [0 c_{0}^{+}]_{2 \times 1} .

ξ_{j}^{(1)}

ξ_{j}^{(1)}

ξ_{j}^{(2)}

= β_{1}^{T} σ_{2} (ω_{1} (c_{j - 1}^{+} + c_{j - 1}^{-} ξ_{j - 1}^{(1)}) + γ_{1} ξ_{j - 1}^{(2)}),

x_{j} = σ_{2} (A_{j 1} [ξ_{j - 1}^{(1)} ξ_{j - 1}^{(2)}] + b_{j 1}), A_{j 1} = [1 c_{j - 1}^{-} ω_{1} 0 γ_{1}]_{5 \times 2}, b_{j 1} = [0 c_{j - 1}^{+} ω_{1}]_{5 \times 1},

x_{j} = σ_{2} (A_{j 1} [ξ_{j - 1}^{(1)} ξ_{j - 1}^{(2)}] + b_{j 1}), A_{j 1} = [1 c_{j - 1}^{-} ω_{1} 0 γ_{1}]_{5 \times 2}, b_{j 1} = [0 c_{j - 1}^{+} ω_{1}]_{5 \times 1},

[ξ_{j}^{(1)} ξ_{j}^{(2)}] = A_{j + 1, 0} x_{j} + b_{j + 1, 0}, where A_{j + 1, 0} = [10 0 β_{1}^{T}]_{2 \times 5}, b_{j + 1, 0} = 0 .

[ξ_{j}^{(1)} ξ_{j}^{(2)}] = A_{j + 1, 0} x_{j} + b_{j + 1, 0}, where A_{j + 1, 0} = [10 0 β_{1}^{T}]_{2 \times 5}, b_{j + 1, 0} = 0 .

A_{j} = A_{j 1} A_{j 0}, b_{j} = A_{j 1} b_{j 0} + b_{j 1}, j = 2, \dots, m .

A_{j} = A_{j 1} A_{j 0}, b_{j} = A_{j 1} b_{j 0} + b_{j 1}, j = 2, \dots, m .

x^{n} = ξ_{m}^{(1)} ξ_{m}^{(2)} = β_{1}^{T} σ_{2} (ω_{1} ξ_{m}^{(1)} + γ_{1} ξ_{m}^{(2)}),

x^{n} = ξ_{m}^{(1)} ξ_{m}^{(2)} = β_{1}^{T} σ_{2} (ω_{1} ξ_{m}^{(1)} + γ_{1} ξ_{m}^{(2)}),

x_{m + 1} = σ_{2} (A_{m + 1, 1} [ξ_{m}^{(1)} ξ_{m}^{(2)}]), \mbox w h er e A_{m + 1, 1} = [ω_{1} γ_{1}]_{4 \times 2} .

x_{m + 1} = σ_{2} (A_{m + 1, 1} [ξ_{m}^{(1)} ξ_{m}^{(2)}]), \mbox w h er e A_{m + 1, 1} = [ω_{1} γ_{1}]_{4 \times 2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

00footnotetext: † The first two authors contributed equally. Author list is alphabetical.00footnotetext: ‡ The work of this author is partially done during her Ph.D. study in Academy of Mathematics and Systems Science, Chinese Academy of Sciences.

Better Approximations of High Dimensional Smooth

Functions by Deep Neural Networks with Rectified Power Units

Bo Li\comma*,†*

2

1

Shanshan Tang*,†,‡* and Haijun Yu\comma\comma\corrauth

3

1

2

11affiliationmark: NCMIS & LSEC, Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of

Sciences, Beijing 100190, China

22affiliationmark: School of Mathematical Sciences, University of Chinese Academy of Sciences,

Beijing 100049, China

33affiliationmark: China Justice Big Data Institute, Beijing 100043, China

[[email protected](B. Li),

[email protected](S. Tang), [email protected](H. Yu) ](mailto:)

Abstract

Deep neural networks with rectified linear units (ReLU) are getting more and more popular due to their universal representation power and successful applications. Some theoretical progress regarding the approximation power of deep ReLU network for functions in Sobolev space and Korobov space have recently been made by [D. Yarotsky, Neural Network, 94:103-114, 2017] and [H. Montanelli and Q. Du, SIAM J Math. Data Sci., 1:78-92, 2019], etc. In this paper, we show that deep networks with rectified power units (RePU) can give better approximations for smooth functions than deep ReLU networks. Our analysis bases on classical polynomial approximation theory and some efficient algorithms proposed in this paper to convert polynomials into deep RePU networks of optimal size with no approximation error. Comparing to the results on ReLU networks, the sizes of RePU networks required to approximate functions in Sobolev space and Korobov space with an error tolerance $\varepsilon$ , by our constructive proofs, are in general $\mathcal{O}(\log\frac{1}{\varepsilon})$ times smaller than the sizes of corresponding ReLU networks constructed in most of the existing literature. Comparing to the classical results of Mhaskar [Mhaskar, Adv. Comput. Math. 1:61-80, 1993], our constructions use less number of activation functions and numerically more stable, they can be served as good initials of deep RePU networks and further trained to break the limit of linear approximation theory. The functions represented by RePU networks are smooth functions, so they naturally fit in the places where derivatives are involved in the loss function.

keywords:

deep neural network, high dimensional approximation, sparse grids, rectified linear unit, rectified power unit, rectified quadratic unit

\ams

65D15, 65M12, 65M15

1 Introduction

Artificial neural network(ANN), whose origin may date back to the 1940s[1], is one of the most powerful tools in the field of machine learning. Especially, it became dominant in a lot of applications after the seminal works by Hinton et al.[2] and Bengio et al.[3] on efficient training of deep neural networks (DNNs), which pack up multi-layers of units with some nonlinear activation function. Since then, DNNs have greatly boosted the developments in different areas including image classification, speech recognition, computational chemistry and numerical solutions of high-dimensional partial differential equations and scientific problems, etc., see e.g. [4][5] [6][7][8][9][10][11][12] to name a few.

The success of DNNs relies on two facts: 1) DNN is a powerful tool for general function approximation; 2) Efficient training methods are available to find minimizers with good generalization ability. In this paper, we focus on the first fact. It is known that artificial neural networks can approximate any $C^{0}$ and $L^{1}$ functions with any given error tolerance, using only one hidden layer (see e.g. [13][14]). However, it was realized recently that deep networks have better representation power( see e.g. [15][16][17]) than shallow networks. One of the commonly used activation functions with DNN is the so called rectified linear unit (ReLU)[18], which is defined as $\sigma(x)=\max(0,x)$ . Telgarsky [16] gave a simple and elegant construction showing that for any $k$ , there exist $k$ -layer, $\mathcal{O}(1)$ wide ReLU networks on one-dimensional data, which can express a sawtooth function on $[0,1]$ with $\mathcal{O}(2^{k})$ oscillations. Moreover, such a rapidly oscillating function cannot be approximated by poly $(k)$ -wide ReLU networks with $o(k/\log(k))$ depth. Following this approach, several other works proved that deep ReLU networks have better approximation power than shallow ReLU networks [19][20][21][22]. In particular, for $C^{\beta}$ -differentiable $d$ -dimensional functions, Yarotsky [21] proved that the number of parameters needed to achieve an error tolerance of $\varepsilon$ is $\mathcal{O}(\varepsilon^{-\frac{d}{\beta}}\log\frac{1}{\varepsilon})$ . Petersen and Voigtlaender [22] proved that for a class of $d$ -dimensional piecewise $C^{\beta}$ continuous functions with the discontinuous interfaces being $C^{\beta}$ continuous also, one can construct a ReLU neural network with $\mathcal{O}((1+\frac{\beta}{d})\log_{2}(2+\beta))$ layers, $\mathcal{O}(\varepsilon^{-\frac{2(d-1)}{\beta}})$ nonzero weights to achieve $\varepsilon$ -approximation. The complexity bound is sharp. For analytic functions, E and Wang [23] proved that using ReLU networks with fixed width $d+4$ , to achieve an error tolerance of $\varepsilon$ , the depth of the network depends on $\log\frac{1}{\varepsilon}$ instead of $\varepsilon$ itself. We also want to mention that the detailed relations between ReLU networks and linear finite elements have been studied by He et al.[24]. And recent work by Opschoor, Peterson and Schwab [25] reveals the connection between ReLU DNNs and high-order finite element methods.

One basic fact on deep ReLU networks is that function $x^{2}$ can be approximated within any error $\varepsilon>0$ by a ReLU network having the depth, the number of weights and computation units all of order $\mathcal{O}(\log\frac{1}{\varepsilon})$ . This fact has been used by several groups (see e.g. [19][21]) to analyze the approximation property of general smooth functions using ReLU networks. In this paper, we extend the analysis to deep neural networks using rectified power units (RePUs), which are defined as

[TABLE]

where $\mathbb{N}_{0}$ denotes the set of non-negative integers. Note that $\sigma_{1}$ is the commonly used ReLU function, $\sigma_{0}$ is the binary step function. We call $\sigma_{2}$ , $\sigma_{3}$ rectified quadratic unit (ReQU) and rectified cubic unit, respectively. We show that deep neural networks using RePUs( $s\geq 2$ ) as activation functions have better approximation property for smooth functions than those using ReLUs. By replacing ReLU with RePU( $s\geq 2$ ), the functions $x$ , $x^{2}$ and $xy$ can be exactly represented with no approximation error using networks having just a few nodes and nonzero weights. Based on this, we build efficient algorithms to explicitly convert functions from a polynomial space into RePU networks having approximately the same number of coefficients. This allows us to obtain a better upper bound of the best neural network approximation for general smooth functions using classical polynomial approximation theories. Note that $\sigma_{s}$ networks have been used in the classic works by Mhaskar and his coworkers (see e.g. [26] [27][28]), where by converting spline approximations into $\sigma_{s}$ DNNs, quasi-optimal theoretical upper bounds of function approximation are obtained. However, their constructions of neural network are not optimal for very smooth functions (the case $k\gg s$ ), the error bound obtained is quasi-optimal due to an extra $\log_{s}(k)$ factor, where $k$ is related to the smoothness of the underlying functions. Meanwhile no numerically efficient and stable algorithm is presented. In this paper, we present numerically stable and efficient constructions of RePU network representation of polynomials which result in RePU network of different structure and remove the extra $\log_{s}(k)$ factor in the approximation bounds. After this paper is put on arXiv, the RePU networks and our optimal network constructions are adopted by other authors, e.g., by using RePU networks instead of ReLU networks, a sharper bound for approximating holomorphic maps in high dimension is obtained by Opschoor, Schwab and Zech [29].

For high dimensional problems, to be tractable, the intrinsic dimension usually do not grow as fast as the observation dimension. In other words, the problems have low dimensional structure. A particular example is the class of high-dimensional smooth functions with bounded mixed derivatives, for which sparse grid (or hyperbolic cross) approximation is a very popular approximation tool [30][31][32][33][34]. In the past few decades, sparse grid method and hyperbolic cross approximations have found many applications, such as numerical integration and interpolation [30][35][36],[37], solving partial differential equations (PDE) [38] [39] [40][41][42][43], computational chemistry [32] [44][45][46], uncertainty quantification [47][48][49], etc. For high dimensional problem, we will derive upper bounds of RePU DNN approximation error by converting sparse grid and hyperbolic cross spectral approximation into RePU networks. Our work is inspired by the recent work of Montanelli and Du [50], where the connection between linear finite element sparse grids and deep ReLU neural networks is established. In this paper, we approximate multivariate functions in high order Korobov space using sparse grid Chebyshev interpolation [36] for the interpolation problem, and using hyperbolic cross spectral approximation for the projection problem [33][40]. Then, we convert the high-dimensional polynomial approximations into ReQU networks, instead of ReLU networks, to avoid adding an extra factor $\log\frac{1}{\varepsilon}$ in the size of the neural network.

In summary, we find that RePU networks have the following good properties:

•

RePU neural networks provide better approximations for sufficient smooth functions comparing to ReLU neural network approximations. To achieve same accuracy, the RePU network approximation we constructed needs less number of layers and smaller network size than existing ReLU neural network approximations. For example, for a function with all the partial derivatives bounded uniformly independent of derivative order, we can construct a ReQU network with no more than $\mathcal{O}\left(\log_{2}\left(\log\frac{1}{\varepsilon}\right)\right)$ layers, and no more than $\mathcal{O}\big{(}\frac{\log\left(1/\varepsilon\right)}{\log(\log 1/\varepsilon)}\big{)}$ nonzero weights to approximate it with error $\varepsilon$ . More results are given in Theorem 2.12, 3.4, 4.4.

•

The functions represented by RePU( $s\geq 2$ ) networks are smooth functions, so they naturally fit in the places where derivatives are involved in the loss function.

•

Compared to other high-order differentiable activation functions, such as logistic, $\tanh$ , softplus, sinc etc., RePUs are more efficient in terms of number of arithmetic operations needed to evaluate, especially the rectified quadratic unit.

Based on the facts above, we advocate the use of deep RePU networks in places where the functions to be approximated are smooth.

The remaining part of this paper is organized as follows. In Section 2, we first show how to approximate univariate smooth functions using RePU networks by converting best polynomial approximations into RePU networks. Then we use a similar approach to analyze the ReQU network approximation for multivariate functions in weighted Sobolev space in Section 3. After that, we show how high-dimensional functions with sparse polynomial approximations can be well approximated by ReQU networks in Section 4. Some preliminary numerical results are given in Section 5. We end the paper by a short summary in Section 6.

2 Approximation of univariate functions by deep RePU networks

We first introduce some notations related to neural networks. Denote by $\mathbb{N}$ the set of all positive integers, $\mathbb{N}_{0}:=\{0\}\cup\mathbb{N}$ . Given $d,L\in\mathbb{N}$ , we denote a neural network $\Phi$ with input of dimension $d$ , number of layer $L$ , by a matrix-vector sequence

[TABLE]

where $N_{0}=d$ , $N_{1},\cdots,N_{L}\in\mathbb{N}$ , $A_{k}$ are $N_{k}\times N_{k-1}$ matrices, and $b_{k}\in\mathbb{R}^{N_{k}}$ . If $\Phi$ is a neural network, and $\rho:\mathbb{R}\to\mathbb{R}$ is an arbitrary activation function, then define

[TABLE]

where $R_{\rho}(\Phi)(\bm{x})$ is given as

[TABLE]

and

[TABLE]

We use three quantities to measure the complexity of the neural network: number of hidden layers, number of nodes (i.e. activation units), and number of nonzero weights, which are $L-1$ , $\sum_{k=1}^{L-1}N_{k}$ and number of non-zeros in $\{(A_{k},b_{k}),k=1,\ldots,L\}$ , respectively, for the neural network defined in (2). For convenience, we denote by $\#A$ the number of nonzero components in $A$ for a given matrix or vector $A$ . For the neural network $\Phi$ defined in (2), we also denote its number of nonzero weights as $\#\Phi:=\sum_{k=1}^{L}(\#A_{k}+\#b_{k})$ .

In this paper we study the approximation property of smooth functions by deep neural networks with RePUs as activation units. It seems that $\sigma_{s}$ networks were first used in the classic works by Mhaskar and his coworkers (see e.g. [26], [27]) to obtain high-order convergence of neural network approximation. $\sigma_{s}$ is also a special case of piece-wise polynomial activation function, which has been studied in [51] for shallow network approximation. We also note that $\sigma_{3}$ has been used in a deep Ritz method proposed recently to solve PDEs using variational form [52].

The construction of RePU networks adopted by Mhaskar bases on the fact that a polynomial of degree $n$ in $d$ dimension can be represented by a linear combination of $\binom{n+d}{d}$ number of monomials of the form $\big{(}Ax+b\big{)}^{n}$ , with each one using different affine transform. To represent a polynomial of degree $n$ using $\sigma_{s}$ neural network, they first compose $\sigma_{s}(x)$ for $k=\lceil\log_{s}n\rceil$ times, which result in $\sigma_{s^{k}}(x)$ . Then a neural network with one-layer $\sigma_{s^{k}}(x)$ units of amount $\binom{n+d}{d}$ is capable to accurately represent any polynomial of degree $n$ . This kind of construction give an optimal linear approximation result for neural network using high order (the order is $s^{k}$ ) sigmoidal activation functions. However, if regard the constructed neural network as a $\sigma_{s}$ neural network, it has $k$ hidden layers. The corresponding linear approximation bound is quasi-optimal due to this factor $k$ . Moreover, to find the corresponding network coefficients to represent a given polynomial, one needs to solve a Vandermonde-like matrix, whose condition number is known grows geometrically (see e.g. [53]). In this paper, we propose a different approach which does not involve any Vandermonde matrix of large size.

2.1 Approximation by deep ReQU networks

Our analyses relies upon the fact: $x$ , $x^{2}$ , $\ldots$ , $x^{s}$ , and $xy$ all can be realized by $\sigma_{s}$ neural networks with a few number of coefficients. We first give the result for $s=2$ case.

Lemma 2.1.

For any $x,y\in\mathbb{R}$ , the following identities hold:

[TABLE]

where

[TABLE]

If both $x$ and $y$ are non-negative, the formula for $x^{2}$ and $xy$ can be simplified to the following form

[TABLE]

where

[TABLE]

Proof 2.2.

All the identities can be obtained by straightforward calculations.

Note that the realizations given in Lemma 2.1 are not unique. For example, to realize $id_{\mathbb{R}}(x)=x$ , we may use

[TABLE]

for general $x\in\mathbb{R}$ , and use

[TABLE]

for non-negative $x$ . To have a neat presentation, we will use (5)-(11) throughout this paper even though simpler realizations may exist for some special cases. We notice that the realization of the identity map $id_{\mathbb{R}}(x)$ given in (6) is a special case of $(\ref{eq:s2xy})$ with $y=1$ . Furthermore, the constant function $1$ can be represented by a trivial network with $L=1$ and $A_{1}=0,b_{1}=1$ .

Remark 2.3.

Notice that in [21, 22, 50], all the analyses rely on the fact that $x^{2}$ can be approximated to an error tolerance $\varepsilon$ by a deep ReLU networks of complexity $\mathcal{O}(\log\frac{1}{\varepsilon})$ . In our approach, by replacing ReLU with ReQU, $x^{2}$ is represented with no error using a ReQU network with only one hidden layer and 2 hidden neurons. This simple replacement greatly simplifies the proofs of some existing deep neural network approximation bounds, improves the approximation rate and meanwhile reduces the network complexity.

2.1.1 Optimal realizations of polynomials by deep ReQU networks with no error

The basic property of $\sigma_{2}$ given in Lemma 2.1 can be used to construct deep neural network representations of monomials and polynomials. We first show that the monomial $x^{n},n>2$ can be represented exactly by deep ReQU networks of finite size but not shallow ReQU networks.

Theorem 2.4.

A) The monomial $x^{n},n\in\mathbb{N}$ defined on $\mathbb{R}$ can be represented exactly by a $\sigma_{2}$ network. The number of network layers, number of hidden nodes and number of nonzero weights required to realize $x^{n}$ are at most $\lfloor\log_{2}n\rfloor+2$ , $5\lfloor\log_{2}n\rfloor+5$ and $25\lfloor\log_{2}n\rfloor+14$ , respectively. Here $\lfloor x\rfloor$ represents the largest integer not exceeding $x$ for $x\in\mathbb{R}$ .

B) For any $n>2$ , $x^{n}$ can not be represented exactly by any ReQU network with less than $\lceil\log_{2}n\rceil$ hidden layers.

Proof 2.5.

1) We first prove part B. For a one-layer ReQU network with $N$ activation units, one input and one output, the function represented by the network can be written as

[TABLE]

where $d$ and $a_{k},b_{k},c_{k}$ , $k=1,\ldots,N$ are the parameters of the network. Obviously, $f_{N}$ is a piecewise polynomial with at most $N+1$ pieces in the intervals divided by distinct points of $x_{k}=-b_{k}/a_{k}$ , $k=1,\ldots,N$ (suppose the points are in ascending order). In each piece, $f_{N}$ is a polynomials of degree 2. Since a polynomial of degree at most 2 composed with another polynomial of degree at most 2 produces a polynomial of degree at most 4, so a ReQU network with two hidden layers can only represent piecewise polynomials of degree at most 4. By induction, a ReQU network with $m$ hidden layers can only represent piecewise polynomials of degree at most $2^{m}$ . So, with $m<\lceil\log_{2}n\rceil$ , a ReQU network with $m$ hidden layers can’t exactly represent $x^{n}$ .

2) Now we give a constructive proof for part A. We first express $n$ in binary system as follows:

[TABLE]

where $a_{j}\in\{0,1\}$ for $j=0,1,...,m-1$ , $a_{m}=1$ , and $m=\lfloor\log_{2}n\rfloor$ . Then

[TABLE]

Introducing intermediate variables

[TABLE]

then

[TABLE]

We use the iteration scheme

[TABLE]

and (12) to realize $x^{n}$ . The outline of the realization is demonstrated in Fig. 1. In each iteration step, we need to realize two basic operations: $(x)^{2}$ and $(x)^{a_{k}}y$ , where $x,y$ stands for $\xi_{k-1}^{(1)},\xi_{k-1}^{(2)}$ respectively. Note that $(x)^{2}$ can be realized by Eqs. (5) and (9) in Lemma 2.1. For the operation $(x)^{a_{j}}y$ , since $a_{j}\in\{0,1\}$ , by (7), we have

[TABLE]

where $c^{\pm}_{j}:=\frac{1\pm(-1)^{a_{j}}}{2}$ .

Now we describe the procedure in detail. For $n\geq 3$ , we follow the idea given in Eq. (13) and Fig. 1. The function $x^{n}$ is realized in $m+1$ steps, which are discussed below.

In Step $1$ , we calculate

[TABLE]

which implies the first layer output of the neural network is:

[TABLE]

and

[TABLE]

Since $\#\omega_{1}=4$ , $\#\omega_{2}=2$ , $\#\gamma_{1}=4$ , it is easy to see that the number of nodes in the first hidden layer is $6$ , and the number of non-zeros is: $\#{A}_{1}+\#{b}_{1}=10$ . 2. 2)

In Step $j$ , $2\leq j\leq m$ , we calculate

[TABLE]

which suggest the $j$ -th layer output of the neural network is:

[TABLE]

and

[TABLE]

We have

[TABLE]

By a direct calculation, we find that the number of nodes in Layer $j$ is $5\ (j=2,\ldots,m)$ , and the number of non-zeros in Layer $j$ , $j=3,\ldots,m$ is $\#{A}_{j}+\#{b}_{j}\leq 21+4=25$ . For $j=2$ , $\#{A}_{2}+\#{b}_{2}=26+4=30$ . 3. 3)

In Step $m+1$ , we calculate

[TABLE]

which implies

[TABLE]

So we get $\bm{x}_{m+1}=\sigma_{2}(A_{m+1}\bm{x}_{m}+b_{m+1})$ , with

[TABLE]

and

[TABLE]

By a direct calculation, we get the number of nodes in Layer $m+1$ is $4$ , the number of non-zero weights is $\#{A}_{m+1}=20$ .

For Layer $m+2$ , which is the output layer of the overall network, ${A}_{m+2}=\beta^{T}_{1}$ , and ${b}_{m+2}=0$ . There are no activation units and the number of nonzero weights is $\#A_{m+2}=4$ .

The ReQU network we just built has $m+2$ layers. The total number of nodes is $6+5(m-1)+4=5m+5$ . The total number of nonzero weights is $10+30+25(m-2)+20+4=25m+14$ . Combining the cases $n=1,2$ , we reach to the desired conclusion.

Now we consider how to convert univariate polynomials into $\sigma_{2}$ networks. If we directly apply Theorem 2.4 to each monomial term in a polynomial and then combine them together, one would obtain a network of depth $\mathcal{O}(\log_{2}n)$ and size $\mathcal{O}(n\log_{2}n)$ , which is not optimal. We provide here two algorithms to convert a polynomial into a ReQU network of same scale, i.e. without the extra $\log_{2}n$ factor. The first algorithm is a direct implementation of Horner’s method (also known as Qin Jiushao’s algorithm in China):

[TABLE]

To describe the algorithm iteratively, we introduce the following intermediate variables

[TABLE]

Then we have $y_{1}=f(x)$ . By implementing of $y_{k}$ for each $k$ , using the realizations formula given in Lemma 2.1, and stacking the implementations of $n$ steps up, we obtain a $\sigma_{2}$ neural network with $\mathcal{O}(n)$ layers and where each layer has a constant width independent of $n$ .

The second construction given in the following theorem can achieve same representation power with same amount of weights but much less layers.

Theorem 2.6.

If $f(x)$ is a polynomial of degree $n$ on $\mathbb{R}$ , then it can be represented exactly by a $\sigma_{2}$ neural network with $\lfloor\log_{2}n\rfloor+1$ hidden layers, and the numbers of nodes and nonzero weights are both of order $\mathcal{O}(n)$ . To be more precise, the number of nodes is bounded by $9n$ , and number of nonzero weights is bounded by $61n$ .

Proof 2.7.

Assume $f(x)=\sum_{j=0}^{n}a_{j}x^{j}$ , $a_{n}\neq 0$ . We first use an example with $n=15$ to demonstrate the process of network construction as follows:

[TABLE]

Here $\xi_{1,j_{1}},j_{1}=0,1,2,\cdots,8$ , $\xi_{2,j_{2}},j_{2}=0,1,2,\cdots,4$ , and $\xi_{3,j_{3}}$ , $j_{3}=0,1,2$ are the intermediate variable output of Layer $1$ , $2$ , $3$ , respectively. The final output is $f(x)=\xi_{3,0}\xi_{3,2}+\xi_{3,1}$ .

We first describe the construction for the case $n\geq 4$ here.

Denote $m=\lfloor\log_{2}n\rfloor$ . We first extend $f(x)$ to include monomials up to degree $2^{m+1}-1$ by zero padding:

[TABLE]

The process of building a $\sigma_{2}$ network to represent $f(x)$ is similar to the case $n=15$ . We give details below.

The output of Layer $1$ intermediate variables are:

[TABLE]

which suggest

[TABLE]

and

[TABLE]

with $\bm{\xi}_{1}=[\xi_{1,1},\xi_{1,2},\ldots,\xi_{1,2^{m}},\xi_{1,0}]^{T}$ , $a_{21}=[a_{1},a_{3},\ldots,a_{2^{m+1}-1}]^{T}$ , $a_{22}=[a_{0},a_{2},\ldots,a_{2^{m+1}-2}]^{T}$ . 2. 2)

The output of Layer $2$ intermediate variables are:

[TABLE]

which imply

[TABLE]

and most elements in $A_{21},b_{21}$ are zeros. The nonzero elements are given below using a Matlab subscript style as:

[TABLE]

for $j=1,2,\ldots,2^{m-1}$ , and the last element of $A_{2,1}$ is $1$ . According to the result (33) of Layer $1$ , we get

[TABLE]

We also have

[TABLE]

Here $\bm{\xi}_{2}=[\xi_{2,1},\xi_{2,2},\ldots,\xi_{2,2^{m-1}},\xi_{2,0}]^{T}$ , and $I_{2^{m-1}}$ is the identity matrix in $\mathbb{R}^{2^{m-1}}$ . $\otimes$ stands for Kronecker product. 3. 3)

The output of Layer $k\ (3\leq k\leq m)$ intermediate variables are:

[TABLE]

Denote $\bm{\xi}_{k}=[\xi_{k,1},\xi_{k,2},\ldots,\xi_{k,2^{m-k+1}},\xi_{k,0}]^{T}$ . We have

[TABLE]

where $A_{k1},b_{k1}$ have the same formula as $A_{21},b_{21}$ given in (37) except that the maximum value of $j$ is $2^{m-k+1}$ rather than $2^{m-1}$ , and $A_{k+1,0}$ has the same formula as $A_{30}$ given in (39) with $\bm{1}_{2^{m-1}\times 1}$ replaced by $\bm{1}_{2^{m-k+1}\times 1}$ and $\bm{1}_{n}=[1,\ldots,1]^{T}\in\mathbb{R}^{n\times 1}$ . Combining (42) and (39), we get

[TABLE] 4. 4)

The output of Layer $m+1$ intermediate variables are:

[TABLE]

Written into the following form

[TABLE]

we have

[TABLE]

and

[TABLE]

The iteration formula for $\bm{x}_{m+1}$ is $\bm{x}_{m+1}=\sigma_{2}(A_{m+1}\bm{x}_{m}+b_{m+1})$ , where

[TABLE] 5. 5)

Since $\bm{\xi}_{m+1}=f(x)$ , the network ends at Layer $m+2$ , with $\bm{x}_{m+2}=\bm{\xi}_{m+1}$ . So we get ${A}_{m+2}=A_{m+2,0}$ , and $b_{m+2}=0$ from Eq. (45).

For $n<4$ , the procedure can be obtained by removing some sub-steps from the cases $n\geq 4$ . From the construction process, we see that the number of layers is $m+2$ , the numbers of nodes in Layer 1 to Layer $m+1$ are $6$ , $8\times 2^{m-k+1}+1(2\leq k\leq m)$ and 8 respectively, and the number of nonzero weights in $\bm{A}_{j}$ , $\bm{b}_{j}(1\leq j\leq m+2)$ are not bigger than 10, $(40\times 2^{m-1}+2)+8\times 2^{m-1}$ , $(68\times 2^{m-j+1}+1)+4\times 2^{m-j+1}(3\leq j\leq m)$ , 72, 8 respectively. Summing up these numbers, we reach the desired bound.

Remark 2.8.

Theorem 2.4 says we can use a $\sigma_{2}$ network of scale $\mathcal{O}(\log_{2}n)$ to represent $x^{n}$ exactly. Theorem 2.6 says that any polynomial of degree less than $n$ can be represented exactly by a $\sigma_{2}$ neural network with $\lfloor\log_{2}n\rfloor+1$ hidden layers, and no more than $\mathcal{O}(n)$ nonzero weights. Such results are not available for ReLU network and neural networks using other non-polynomial activation functions, such as $\operatorname{logistic}$ , $\tanh$ , $\operatorname{softplus}$ , $\operatorname{sinc}$ etc. We note that the constants in the two theorems may not be optimal, but the orders of number of layers and number of nonzero weights are optimal.

2.1.2 Error bounds of approximating smooth functions by

deep ReQU networks

Now we analyze the error of approximating general smooth functions using ReQU networks. We first introduce some notations and give a brief review of some classical results of polynomial approximation.

Let $\Omega\subseteq\mathbb{R}^{d}$ be the domain on which the function to be approximated is defined. For the 1-dimensional case in this section, we focus on $\Omega=I:=[-1,1]$ . Similar discussions and results can be extended to $\Omega=[0,\infty)$ and $(-\infty,\infty)$ as well. We denote the set of polynomials with degree up to $N$ defined on $\Omega$ by ${P}_{N}(\Omega)$ , or simply ${P}_{N}$ . Let $J^{\alpha,\beta}_{n}(x)$ be the Jacobi polynomial of degree $n$ , $n\in\mathbb{N}_{0}$ ; the family of all these polynomials forms a complete set of orthogonal bases in the weighted $L^{2}_{\omega^{\alpha,\beta}}(I)$ space with respect to weight $\omega^{\alpha,\beta}(x)=(1-x)^{\alpha}(1+x)^{\beta}$ for $\alpha,\beta>-1$ . To describe functions with high order regularity, we define the Jacobi-weighted Sobolev space $B_{\alpha,\beta}^{m}(I)$ as (see e.g. [54]):

[TABLE]

with norm

[TABLE]

Define the $L^{2}_{\omega^{\alpha,\beta}}$ -orthogonal projection $\pi^{\alpha,\beta}_{N}$ : $L^{2}_{\omega^{\alpha,\beta}}(I)\rightarrow P_{N}$ by requiring

[TABLE]

A detailed error estimate on the projection error $\pi_{N}^{\alpha,\beta}u-u$ is given in Theorem 3.35 of [54], by which we have the following theorem on the approximation error of ReQU networks.

Theorem 2.9.

Let $\alpha,\beta>-1$ , $N\geq 1$ . For any $u\in B^{m}_{\alpha,\beta}(I)$ , there exist a ReQU network $\Phi^{u}_{N}$ with $\lfloor\log_{2}N\rfloor+1$ hidden layers, $\mathcal{O}(N)$ nodes, and $\mathcal{O}(N)$ nonzero weights, satisfying the following estimates.

1) If $0\leq l\leq m\leq N+1$ , we have

[TABLE]

2) If $m>N+1\geq l$ , we have

[TABLE]

Here $c\approx 1$ for $N\gg 1$ .

Proof 2.10.

For any given $u\in B^{m}_{\alpha,\beta}(I)$ , the polynomial $f=\pi^{\alpha,\beta}_{N}u\in P_{N}$ . The projection error $\pi^{\alpha,\beta}_{N}u-u$ is estimated by Theorem 3.35 in [54], which is (52) and (53) with $R_{\sigma_{2}}(\Phi^{u}_{N})$ replaced by $\pi^{\alpha,\beta}_{N}u$ . By Theorem 2.6, $f$ can be represented exactly by a ReQU network $\Phi^{u}_{N}$ with $\lfloor\log_{2}N\rfloor+1$ hidden layers, $\mathcal{O}(N)$ nodes, and $\mathcal{O}(N)$ nonzero weights, i.e. $R_{\sigma_{2}}(\Phi^{u}_{N})=\pi^{\alpha,\beta}_{N}u$ . We thus obtain estimation (52) and (53).

Remark 2.11.

In (52) and (53), we allow the error measured in high-order derivatives, i.e. $l\geq 3$ , because $R_{\sigma_{2}}(\Phi^{u}_{N})$ is an exact realization of a polynomial, which is infinitely differentiable. In practice, if $\Phi^{u}_{N}$ is a trained network with numerical error, we can not measure the error with derivatives order $\geq 3$ , since $\partial_{x}^{3}\sigma_{2}(x)$ is not in $L^{2}$ space.

Based on Theorem 2.9, we can analyze the network complexity of $\varepsilon$ -approximation of a given function with certain smoothness. For simplicity, we only consider the case with $l=0$ . The result is given in the following theorem.

Theorem 2.12.

For any given function $f(x)\in B^{m}_{\alpha,\beta}(I)$ with norm less than $1$ , where $m$ is either a fixed positive integer or infinity, and for $\varepsilon\in(0,1)$ small enough, there exists a ReQU network $\Phi^{f}_{\varepsilon}$ with number of layers $L$ , number of nonzero weights $N$ satisfying

•

if $m$ is a fixed positive integer, then $L=\mathcal{O}\left(\frac{1}{m}\log_{2}\frac{1}{\varepsilon}\right)$ , and $N=\mathcal{O}\big{(}{\varepsilon}^{-\frac{1}{m}}\big{)}$ ;

•

if $m=\infty$ , i.e. $f(x)\in B^{\infty}_{\alpha,\beta}(I)$ , then $L=\mathcal{O}\left(\log_{2}\left(\log\frac{1}{\varepsilon}\right)\right)$ , and $N=\mathcal{O}\left(\frac{\log(1/\varepsilon)}{\log_{2}(\log(1/\varepsilon))}\right)$ ,

that approximates $f$ within an error tolerance $\varepsilon$ , i.e.

[TABLE]

Proof 2.13.

For a fixed $m$ , or $N\gg m$ , we obtain from (52) that

[TABLE]

By above estimate, we obtain that to achieve an error tolerance $\varepsilon$ to approximate a function with $B^{m}_{\alpha,\beta}(I)$ norm less than $1$ , it suffices to take $N=\left(\frac{c}{\varepsilon}\right)^{\frac{1}{m}}$ . For fixed $m$ , we have $N=\mathcal{O}\big{(}{\varepsilon}^{-\frac{1}{m}}\big{)}$ , the depth of the corresponding ReQU network is $L=\mathcal{O}\left(\frac{1}{m}\log_{2}\frac{1}{\varepsilon}\right)$ .

For $f\in B^{\infty}_{\alpha,\beta}$ , by taking $m=\infty$ in Theorem 2.9, we have

[TABLE]

where $c^{\prime}$ is a general constant, and $\gamma\approx\mathcal{O}(\log N)$ can be larger than any fixed positive number for sufficient large $N$ . To approximate a function with $B^{\infty}_{\alpha,\beta}(I)$ norm less than $1$ with error $\varepsilon=c^{\prime}e^{-\gamma N}$ , it suffices to take $N=\frac{1}{\gamma}\log\left(\frac{c^{\prime}}{\varepsilon}\right)$ , which means $N=\mathcal{O}\left(\frac{\log(1/\varepsilon)}{\log_{2}(\log(1/\varepsilon))}\right)$ . The depth of the corresponding ReQU network is $L=\mathcal{O}\left(\log_{2}\left(\log\frac{1}{\varepsilon}\right)\right)$ . Here $\varepsilon$ is assumed to be small enough such that $\log_{2}\big{(}\log\frac{c^{\prime}}{\varepsilon}\big{)}$ is no less than 1.

2.2 Approximation by deep networks using general rectified power units

The results of approximation monomials, polynomials and general smooth functions by ReQU networks discussed in Subsection 2.1 can be extended to general RePU networks.

To keep the paper short, we only present the results on approximating monomials with RePU in Theorem 2.14. The other results similar to ReQU networks can be obtained but the details are quite lengthy, we report them in a separate paper [55].

Theorem 2.14.

Regarding the problem of using $\sigma_{s}(x)\;(2\leq s\in\mathbb{N})$ neural networks to exactly represent monomial $x^{n}$ , $n\in\mathbb{N}$ , we have the following results:

(1)

If $s=n$ , the monomial $x^{n}$ can be realized exactly using a $\sigma_{s}$ networks having only 1 hidden layer with two nodes.

(2)

If $1\leq n<s$ , the monomial $x^{n}$ can be realized exactly using a $\sigma_{s}$ networks having only 1 hidden layer with no more than $2s$ nodes.

(3)

If $n>s\geq 2$ , the monomial $x^{n}$ can be realized exactly using a $\sigma_{s}$ networks having $\lfloor\log_{s}n\rfloor+2$ hidden layers with no more than $(6s+2)(\lfloor\log_{s}n\rfloor+2)$ nodes, no more than $\mathcal{O}(25s^{2}\lfloor\log_{s}n\rfloor)$ nonzero weights.

Proof 2.15.

(1) It is easy to check that $x^{s}$ has an exact $\sigma_{s}$ realization given by

[TABLE]

(2) For the case of $1\leq n<s$ , we consider the following linear combination

[TABLE]

where $a_{0},a_{k},b_{k}$ , $k=1,\ldots,s$ are parameters to be determined. $C^{s}_{j}$ are binomial coefficients. The above expression is equal to $x^{n}$ , provided that the parameters solve the following linear system:

[TABLE]

where the top-left $s\times s$ submatrix of $D_{s+1}$ is a Vandermonde matrix, which is invertible as long as $b_{k}$ , $k=1,\ldots,s$ are distinct. For simplicity, we choose $b_{k}$ , $k=0,\ldots,s$ to be equidistant points, then (59) is uniquely solvable. Solving for $a_{0},\ldots,a_{s}$ we obtain an exact representation of $x^{n}$ using (58), which corresponds to a neural network having one hidden layer with no more than $2s$ $\sigma_{s}$ units.

For example, when $s=2$ , we may take $b_{1}=-1$ , $b_{1}=1$ . Solving Eq. (59) with $n=1$ , we get $a_{1}=-\frac{1}{4}$ , $a_{2}=\frac{1}{4}$ , and $a_{0}=0$ . Thus

[TABLE]

When $s=3$ , take $b_{1}=-1$ , $b_{2}=0$ , $b_{3}=1$ , we obtain

[TABLE]

(3) Now, we consider the case $n>s\geq 2$ , $n\in\mathbb{N}$ . For any given numbers $y,z\in\mathbb{R}$ , using the identity

[TABLE]

and the fact that $(y+z)^{2}$ , $(y-z)^{2}$ both can be realized exactly by a one layer $\sigma_{s}$ network with no more than $2s$ nodes, we conclude that the product $yz$ can be realized by one layer $\sigma_{s}$ network with no more than $4s$ nodes. To realize $x^{n}$ by $\sigma_{s}$ , we rewrite $n$ in the following form

[TABLE]

where $a_{j}\in\{0,1,\ldots,s-1\}$ for $j=0,1,...,m-1$ and $a_{m}=1$ . So we have

[TABLE]

Define $\xi_{k}=x^{s^{k}}$ , $z_{k+1}=(\xi_{k})^{a_{k}}$ , $k=0,1,\ldots,m$ , and

[TABLE]

we have $y_{m+2}=x^{n}$ . Eq. (63) can be regarded as an iteration scheme, with iteration variables $\xi_{k},y_{k},z_{k}$ , where the subscript $k$ stands for the iteration step. A schematic diagram for this iteration is given in Fig. 2. Different to Theorem 2.4, for $s>2$ , we need a deep $\sigma_{s}$ neural network with $m+2$ hidden layers to realize $x^{n},n>s$ , due to the introduction of intermediate variables $z_{k}$ . In each layer, we need no more than $2+2s+4s$ activation nodes to calculate $\xi_{k+1}=\rho_{s}(\xi_{k})$ , $z_{k+1}=(\xi_{k})^{a_{k}}$ , and $y_{k+1}=z_{k}y_{k}$ . So in total we need no more than $(6s+2)(m+2)=\mathcal{O}{(6s\log_{s}n)}$ nodes. A direct calculation shows that the number of nonzero weights in the network is no more than $\mathcal{O}(25s^{2}\log_{s}n)$ . The theorem is proved.

3 Approximation of multivariate functions

In this section, we discuss how to approximate multivariate smooth functions by ReQU networks. Similar to the univariate case, we first study the representation of polynomials then discuss the approximation error of general smooth functions.

3.1 Deep ReQU network representations of multivariate polynomials

Theorem 3.1.

If $f(x)$ is a multivariate polynomial with total degree $n$ on $\mathbb{R}^{d}$ , then there exists a $\sigma_{2}$ neural network having $d\lfloor\log_{2}n\rfloor+d$ hidden layers with no more than $\mathcal{O}(C^{n+d}_{d})$ activation functions and nonzero weights, that can represent $f$ with no error. We note that, here the constant behind the big $\mathcal{O}$ can be bounded independent of $d$ .

Proof 3.2.

1) We first consider the 2-dimensional case. Suppose $f(x,y)=\sum_{i+j=0}^{n}a_{ij}x^{i}y^{j}$ and $n\geq 4$ (the results for $n\leq 3$ are similar but easier, so skipped here). To represent $f(x,y)$ exactly with a $\sigma_{2}$ neural network based on the results for the 1-dimensional case given in Theorem 2.6, we first rewrite $f(x,y)$ as

[TABLE]

So to realize $f(x,y)$ , we can first realize $a^{y}_{i}$ , $i=0,\ldots,n-1$ using $n$ small $\sigma_{2}$ networks $\Phi_{i}$ , $i=0,\ldots,n-1$ , i.e. $R_{\sigma_{2}}(\Phi_{i})(y)=a^{y}_{i}$ for given input $y$ ; then use a $\sigma_{2}$ network $\Phi_{n}$ to realize the 1-dimensional polynomials $f(x,y)=\sum_{i=0}^{n}a^{y}_{i}x^{i}$ . There are two places that need some technical treatment, the details are given below.

(1)

The network $\Phi_{n}$ takes $a^{y}_{i}$ , $i=0,\ldots,n$ and $x$ as input. So these quantities must be presented at the same layer of the overall neural network, because we do not want connections over non-adjacent layers. By Theorem 2.6, the largest depth of networks $\Phi_{i}$ , $i=0,\ldots,n-1$ is $\lfloor\log_{2}n\rfloor+2$ , so we can lift $x$ to layer $\lfloor\log_{2}n\rfloor+2$ using multiple $id_{\mathbb{R}}(\cdot)$ operations. Similarly, we also keep a record of input $y$ in each layer using multiple $id_{\mathbb{R}}(\cdot)$ operations, such that $\Phi_{i}$ , $i=1,\ldots,n-1$ can start from appropriate layer and generate output exactly at layer $\lfloor\log_{2}n\rfloor+2$ . The overall cost for recording $x,y$ in layers $1,\ldots,\lfloor\log_{2}n\rfloor+2$ is $\mathcal{O}(\lfloor\log_{2}n\rfloor+2)$ , which is small comparing to the number of coefficients $C^{n+2}_{2}$ . 2. (2)

While realizing $\sum_{i=0}^{n}a^{y}_{i}x^{i}$ , the coefficients $a^{y}_{i},i=0,\ldots n$ are network input instead of fixed parameters. So when applying the network construction given in Theorem 2.6, we need to modify the structure of the first layer of the network. More precisely, Eq. (30) in Theorem 2.6 should be changed to

[TABLE]

So the number of nodes for the first layer changed from $6$ to $2+8\cdot 2^{m}$ , the number of nonzero weights for the first layer changed from $10$ to $16\cdot 2^{m}+2$ . So the number of hidden layers, number of nodes and nonzero weights of $\Phi_{n}$ can be bounded by $\lfloor\log_{2}n\rfloor+1$ , $17n$ , and $77n$ respectively.

Assembling $\Phi_{0},\ldots,\Phi_{n}$ , the overall network to represent $f(x,y)$ has $2\lfloor\log_{2}n\rfloor+3$ layers with number of nodes no more than

[TABLE]

and number of weights no more than

[TABLE]

Thus, we proved that the theorem is true for the case $d=2$ .

2) The case $d>2$ can be proved by mathematical induction using the similar procedure as done for $d=2$ case. Note that we pad in some zeros in each direction in the iteration. Since after each dimension iteration, the number of degree of freedom are geometrically reduced, by a straightforward calculation, one can show that the constant behind the big $\mathcal{O}$ can be made independent of dimension $d$ . An improved algorithm using less padding zeros is proposed in another paper [55].

Using a similar approach as in Theorem 3.1, one can easily prove the following theorem.

Theorem 3.3.

For a polynomial $f_{N}$ in a tensor product space $Q_{N}^{d}(I_{1}\times\cdots\times I_{d}):=P_{N}(I_{1})\otimes\cdots\otimes P_{N}(I_{d})$ , there exists a $\sigma_{2}$ network having $d\lfloor\log_{2}N\rfloor+d$ hidden layers with no more than $\mathcal{O}(N^{d})$ activation functions and nonzero weights, can represent $f_{N}$ with no error.

3.2 Error bounds of approximating multivariate

functions by ReQU networks

Now we analyze the error of approximating general multivariate smooth functions using ReQU networks.

For a vector $\bm{x}=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}$ , we define $|\bm{x}|_{1}:=|x_{1}|+\cdots+|x_{d}|$ , $|\bm{x}|_{\infty}:=\max_{i=1}^{d}|x_{i}|$ . Define the high dimensional Jacobi weight as $\omega^{\bm{\alpha},\bm{\beta}}(\bm{x}):=\omega^{\alpha_{1},\beta_{1}}(x_{1})\cdots\omega^{\alpha_{d},\beta_{d}}(x_{d})$ . We define the multidimensional Jacobi-weighted Sobolev space $B_{\alpha,\beta}^{m}(I^{d})$ as [54]:

[TABLE]

with norm and semi-norm

[TABLE]

Define the $L^{2}_{\omega^{\bm{\alpha},\bm{\beta}}}$ -orthogonal projection $\pi^{\bm{\alpha},\bm{\beta}}_{N}$ : $L^{2}_{\omega^{\bm{\alpha},\bm{\beta}}}(I^{d})\rightarrow Q_{N}^{d}(I^{d})$ by the property

[TABLE]

Then for $u\in{B^{m}_{\bm{\alpha},\bm{\beta}}}(I^{d})$ , we have the following error estimate (see Theorem 8.1 and Remark 8.13 in [54]):

[TABLE]

where $c$ is an absolute constant. Combining (67) and Theorem 3.3, we obtain the following upper bound for the $\varepsilon$ -approximation of functions in ${B^{m}_{\bm{\alpha},\bm{\beta}}}(I^{d})$ space.

Theorem 3.4.

For any $u\!\in\!{B^{m}_{\bm{\alpha},\bm{\beta}}}(I^{d})$ , with $|u|_{{B^{m}_{\bm{\alpha},\bm{\beta}}}(I^{d})}\leq 1,\ \bm{\alpha,\beta}\!\in\!(-1,\infty)^{d}$ , and any $\varepsilon\in(0,1)$ there exists a $\sigma_{2}$ neural network $\Phi_{\varepsilon}^{u}$ having $\mathcal{O}\left(\frac{d}{m}\log_{2}\frac{1}{\varepsilon}+d\right)$ layers with no more than $\mathcal{O}\left(\varepsilon^{-d/m}\right)$ nodes and nonzero weights, that approximates $u$ with ${L^{2}_{\omega^{\bm{\alpha},\bm{\beta}}}(I^{d})}$ -error less than $\varepsilon$ , i.e.

[TABLE]

Remark 3.5.

According to the classic nonlinear approximation theory by DeVore, Howard and Micchelli [56], the results of Theorem 2.12 (first part) and Theorem 3.4 are optimal in the case that the approximation depends on the function to be approximated continuously.

Remark 3.6.

Note that results for approximating functions in weighted Sobolev space given in Theorem 3.4 can be extended to $C^{k}$ if $k$ is sufficient large, similar to the second part of Theorem 2.12. Comparing this result with Theorem 1 in [21], we see that the number of computational units and nonzero weights needed by a ReQU network to approximate a function $u\in{B^{m}_{\bm{\alpha},\bm{\beta}}}(I^{d})$ for $m$ sufficient large, with an error tolerance $\varepsilon$ is less than that needed by a ReLU network. The ReLU network is $\log\frac{1}{\varepsilon}$ times larger than corresponding ReQU network. For low accuracy approximation, the factor $\mathcal{O}(\log\frac{1}{\varepsilon})$ is not very big, but for high accuracy approximations, this factor can be as large as several dozens, which could make a big difference in large scale computations.

Note that, for functions with fixed lower order continuity, ReLU network can give good approximation using less number of layers, or use very deep ReLU networks to break the bounds given in Theorem 3.4. We refer interested readers to the recent works by Voigtlaender and Petersen [57], and Yarotsky [58].

4 High-dimensional functions with sparse

polynomial

approximations

In last section, we showed that for a $d$ -dimensional function with partial derivatives up to order $m$ in $L^{2}(I^{d})$ can be approximated within error $\varepsilon$ by a ReQU neural network with complexity $\mathcal{O}(\varepsilon^{-d/m})$ . When $m$ is fixed or much smaller than $d$ , the network complexity has an exponential dependence on $d$ . However, in a lot of applications, high-dimensional problems may have low intrinsic dimension (see e.g. [59][60]). One particular example are high-dimensional tensor product functions(or linear combinations of finite terms of tensor product functions), which can be well approximated by a hyperbolic cross or sparse grid truncated series.

4.1 A brief review of hyperbolic cross approximations and sparse grids

Sparse grids were originally introduced by S. A. Smolyak[30] to integrate or interpolate high dimensional functions. Hyperbolic cross approximation is a technique similar to sparse grids but without the concept of grids. We introduce hyperbolic cross approximation by considering a tensor product function: $f(\bm{x})=f_{1}(x_{1})\cdots f_{d}(x_{d})$ . Suppose that $f_{1},\ldots,f_{d}$ have similar regularity that can be well approximated by using an orthonormal bases $\{\phi_{k},\;k=0,1,\ldots.\}$ ; that is,

[TABLE]

where $c$ is a general constant, $r\geq 1$ is a constant depending on the regularity of $f_{i}$ , $\bar{k}:=\max\{1,k\}$ . So we have an expansion for $f$ as

[TABLE]

where

[TABLE]

Thus, to have a best approximation of $f(\bm{x})$ using finite terms, one should take

[TABLE]

where

[TABLE]

is the hyperbolic cross index set. We call $f_{N}$ defined by (69) a hyperbolic cross approximation of $f$ .

For general functions defined on $I^{d}$ , we choose $\phi_{\bm{k}}$ to be multivariate Jacobi polynomials $J_{\bm{n}}^{\bm{\alpha},\bm{\beta}}$ , and define the hyperbolic cross polynomial space as

[TABLE]

Note that the definition of $X_{N}^{d}$ doesn’t depend $\bm{\alpha}$ and $\bm{\beta}$ . $\{J_{\bm{n}}^{\bm{\alpha},\bm{\beta}}\,\}$ is used to served as a set of bases for $X_{N}^{d}$ . To study the error of hyperbolic cross approximation, we define Jacobi-weighted Korobov-type space

[TABLE]

with norm and semi-norm

[TABLE]

For any given $u\in\mathcal{K}^{0}_{\bm{\alpha},\bm{\beta}}(=B^{0}_{\bm{\alpha},\bm{\beta}})$ , the hyperbolic cross approximation $\pi^{\bm{\alpha,\beta}}_{N,H}u\in X^{d}_{N}$ can be defined as a projection by requiring

[TABLE]

Then we have the following error estimate about the hyperbolic cross approximation (see Theorem 2.2 in [33]):

[TABLE]

where $D_{1}$ is a constant independent of $N$ . It is known that the cardinality of $\chi_{N}^{d}$ is of order $\mathcal{O}(N(\log N)^{d-1})$ in [33]. The above error estimate says that to approximate a function $u\in\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}$ with an error tolerance $\varepsilon$ , one only needs a space of Jacobi polynomials of dimension at most $\mathcal{O}\left(\varepsilon^{-1/m}(\frac{1}{m}\log\frac{1}{\varepsilon})^{d-1}\right)$ , the exponential dependence on $d$ is weakened (cp. Theorem 3.4). To remove the exponential term $(\log\frac{1}{\varepsilon})^{d-1}$ , one may consider a more general sparse polynomial space[33]:

[TABLE]

In particular, $X^{d}_{N,0}=X^{d}_{N}$ is the hyperbolic cross space defined in (71), and $X^{d}_{N,-\infty}:=\text{span}\big{\{}\,J_{\bm{n}}^{\bm{\alpha},\bm{\beta}},\ |\bm{n}|_{\infty}\leq N\,\big{\}}$ is the standard full grid. For $0<\gamma<1$ , it is known that (see lemma 3 in [32]):

[TABLE]

where $C(\gamma,d)$ is a constant that depends on $\gamma$ and $d$ but is independent of $N$ . We call $X_{N,\gamma}^{d},0<\gamma<1$ optimized hyperbolic cross polynomial space. It is proved by Shen and Wang that the $L^{2}_{\omega^{\bm{\alpha},\bm{\beta}}}$ -orthogonal projection $\pi_{N,\gamma}^{\bm{\alpha},\bm{\beta}}$ from Korobov space to $X_{N,\gamma}^{d}$ satisfies the following estimate (see Theorem 2.3 in [33]):

[TABLE]

where $D_{2}$ is a constant independent of $N$ . From (77) and (78), we get that to approximate a function $u\in\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}$ with an error tolerance $\varepsilon$ , one only needs a space of Jacobi polynomials of dimension at most $\mathcal{O}\left(\varepsilon^{-1/m(1-\gamma(1-\frac{1}{d}))}\right)$ . We will later use this estimate to derive another upper bound of approximating functions in ${\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}}$ using deep ReQU networks.

In practice, the exact hyperbolic cross projection is not easy to calculate. An alternate approach is the sparse grid, which uses hierarchical interpolation schemes to build a hyperbolic cross-like approximation of high dimensional functions. To define sparse grids for $I^{d}$ , we first define the underlying 1-dimensional interpolations. Given a series of interpolation point sets $\mathcal{X}^{i}=\{x^{i}_{1},\cdots,x^{i}_{m_{i}}\}\subseteq[-1,1]$ , $m_{i}=\text{Card}(\mathcal{X}^{i})$ , $i=1,2,\ldots$ , with $0<m_{1}<m_{2}<\cdots$ , the interpolation on $\mathcal{X}^{i}$ for $f\in C^{0}(I)$ is defined as

[TABLE]

where $\ell^{i}_{j}(x)\in P_{m_{i}-1}([-1,1])\ (j=1,2,\ldots,m_{i})$ are the Lagrange interpolation polynomials for the interpolation points $\mathcal{X}^{i}$ . The sparse grid interpolation for high-dimension function $f\in C^{0}(I^{d})$ is defined as [30]:

[TABLE]

where $\Delta^{i}=\mathcal{U}^{i}-\mathcal{U}^{i-1}$ , $i\!\in\!\mathbb{N}$ . For convenience, we define $\mathcal{U}^{0}:=0$ , $m_{0}=0$ , $\mathcal{X}^{0}=\emptyset$ . Formally, (80) can be defined on any grids $\{\,\mathcal{X}^{i},\,i=1,2,\ldots,q-d+1\,\}$ . However, to have a one-to-one transform between the values on interpolation points and the coefficients of linearly independent bases in the interpolation space, we need $\{\,\mathcal{X}^{i},\,i=1,2,\ldots,q-d+1\,\}$ to be nested, i.e. $\mathcal{X}^{1}\subset\mathcal{X}^{2}\subset\cdots\subset\mathcal{X}^{q-d+1}$ . Fast transforms between physical values and interpolation coefficients always exist for sparse grid interpolations using nested grids [40, 41]. Define sparse grid index set as

[TABLE]

Then the set of the sparse grid interpolation points and the corresponding interpolation space are given as

[TABLE]

where $\tilde{\phi}_{\bm{k}}$ can be chosen as the hierarchical interpolation basis defined in [40], or the Lagrange-type $d$ -dimensional interpolation polynomial on points $\mathcal{X}^{q}_{d}$ , which takes value $1$ on $\bm{k}$ -th interpolation point and [math] on the other points.

A commonly used 1-dimensional scheme is the Chebyshev-Gauss-Lobatto scheme, which uses the extrema of the Chebyshev polynomials as interpolation points:

[TABLE]

In order to obtain nested sets of points, $m_{i}$ are chosen as

[TABLE]

with $x^{1}_{1}:=0$ . Define

[TABLE]

Then for any function $f\in F^{k}_{d}$ , with $\|f\|_{F^{k}_{d}}:=\max_{|\bm{\alpha}|_{\infty}\leq k}\|\partial^{\alpha}f\|_{L^{\infty}}\leq 1$ , the interpolation error on the above Chebyshev sparse grids are bounded as Theorem 8 in [36]:

[TABLE]

where $n=\text{Card}(\mathcal{X}^{q}_{d})=\text{Card}(\mathcal{I}^{q}_{d})=\mathcal{O}(2^{q}q^{d-1})$ is the number of points in the sparse grids, and $c_{d,k}$ is a constant that depends on $d,k$ only. Note that if a different norm instead of the $L^{\infty}$ norm is used, one can improve the result a little bit, but no results with error bound smaller than $\mathcal{O}(n^{-k})$ is known.

4.2 Error bounds of deep ReQU network approximation for

multivariate

functions with sparse structures

Now we discuss the ReQU network approximation of high-dimensional smooth functions with sparse polynomial expansions, which takes hyperbolic cross and sparse grid polynomial expansions as examples. We introduce the concept of downward closed polynomial space first. A linear polynomial space $P_{C}$ is said to be downward closed if it satisfies the following: if $d$ -dimensional polynomial $p(\bm{x})\in P_{C}$ , then $\partial^{\bm{k}}_{\bm{x}}p(\bm{x})\in P_{C}$ for any $\bm{k}\in\mathbb{N}_{0}^{d}$ , at the same time, there exists a set of bases that is composed of monomials only. It is easy to verify that the hyperbolic cross polynomial space $X^{d}_{N}$ , the sparse grid polynomial interpolation space $V^{q}_{d}$ , and the optimized hyperbolic cross space $X^{d}_{N,\gamma}$ are all downward closed. For a downward closed polynomial space, we have the following ReQU network representation results.

Theorem 4.1.

Let $P_{C}$ be a downward closed linear space of $d$ -dimensional polynomials with dimension $n$ , then for any function $f\in P_{C}$ , there exists a $\sigma_{2}$ neural network having no more than $\sum_{i=1}^{d}\lfloor\log_{2}N_{i}\rfloor+d$ hidden layers, no more than $\mathcal{O}(n)$ activation functions and nonzero weights, can represent $f$ exactly. Here $N_{i}$ is the maximum polynomial degree with respect to the $i$ -th coordinate.

Proof 4.2.

The proof is similar to Theorem 3.1. First, $f$ can be written as a linear combination of monomials.

[TABLE]

where $\chi_{C}$ is the index set of $P_{C}$ with cardinality $n$ . Then we rearrange the summation as

[TABLE]

where $\chi_{C}^{k_{d}}$ are $d-1$ dimensional downward closed index sets that depend on the index $k_{d}$ . If each $a_{k_{d}}^{x_{1},\ldots,x_{k_{d-1}}}$ , $k_{d}=0,1,\ldots,N_{d}$ can be exactly represented by a $\sigma_{2}$ network with no more than $\sum_{i=1}^{d-1}\lfloor\log_{2}N_{i}\rfloor+(d-1)$ hidden layers, no more than $\mathcal{O}(\text{Card}(\chi_{C}^{k_{d}}))$ nodes and nonzero weights, then $f(x)$ can be exactly represented by a $\sigma_{2}$ neural network with no more than $\sum_{i=1}^{d}\lfloor\log_{2}N_{i}\rfloor+d$ hidden layers, no more than $\mathcal{O}(n)$ nodes and nonzero weights, since the operation $\sum_{k_{d}=0}^{N_{d}}a_{k_{d}}^{x_{1},\ldots,x_{k_{d-1}}}x_{d}^{k_{d}}$ can be realized exactly by a $\sigma_{2}$ network with $\lfloor\log_{2}N_{d}\rfloor+1$ hidden layers and no more than $\mathcal{O}(N_{d})$ nodes and nonzero weights. So, by mathematical induction, we only need to prove that when $d=1$ the theorem is satisfied, which is true by Theorem 2.6.

Remark 4.3.

According to Theorem 4.1, we have that:

For any $f\in X^{d}_{N}$ , there exists a ReQU network with no more than $d\lfloor\log_{2}N\rfloor+d$ hidden layers, no more than $\mathcal{O}(N(\log N)^{d-1})$ neurons and nonzero weights, that can represent $f$ with no error. 2. 2)

For any $f\in X^{d}_{N,\gamma}$ with $0<\gamma<1$ , there exists a ReQU network having no more than $d\lfloor\log_{2}N\rfloor+d$ hidden layers, no more than $\mathcal{O}(N)$ neurons and nonzero weights, that can represent $f$ with no error. 3. 3)

For any $f\in V^{q}_{d}$ , there exists a ReQU network having no more than $d(q-d+2)$ hidden layers, no more than $\mathcal{O}(2^{q}q^{d-1})$ neurons and nonzero weights, that can represent $f$ with no error.

Combining the results in Remarks 4.3 with (75), (78) and (87), we obtain the following theorem.

Theorem 4.4.

We have following results for ReQU network approximation of functions in $\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}(I^{d})$ , $\bm{\alpha,\beta}\in(-1,\infty)^{d}$ , $m\geq 1$ and $F^{k}_{d}(I^{d})$ , $k\geq 1$ :

For any function $u\in\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}(I^{d})$ , $m\geq 1$ with $|u|_{\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}}\leq 1/D_{1}$ , any $\varepsilon>0$ , there exists a ReQU network $\Phi_{\varepsilon}^{u}$ with no more than $\frac{d}{m}\log_{2}\frac{1}{\varepsilon}+d$ hidden layers, no more than $\mathcal{O}\big{(}\varepsilon^{-1/m}(\frac{1}{m}\log\frac{1}{\varepsilon})^{d-1}\big{)}$ nodes and nonzero weights, such that

[TABLE] 2. 2)

For any function $u\in\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}(I^{d})$ , $m\geq 1$ with $|u|_{\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}}\leq 1/D_{2}$ , any $\varepsilon>0$ , $0<\gamma<1$ , there exists a ReQU network $\Phi_{\varepsilon}^{u}$ with no more than $\frac{d}{m(1-\gamma(1-\frac{1}{d}))}\log_{2}\frac{1}{\varepsilon}+d$ hidden layers, no more than $\mathcal{O}\big{(}\varepsilon^{-1/[m(1-\gamma(1-\frac{1}{d}))]}\big{)}$ nodes and nonzero weights, such that

[TABLE] 3. 3)

For any function $f\in F^{k}_{d}(I^{d})$ , $k\geq 1$ with $\|f\|_{F^{k}_{d}}\leq 1$ , any $\varepsilon>0$ , there exists a ReQU network $\Psi_{\varepsilon}^{f}$ with no more than $\mathcal{O}\left(\frac{d}{k}\log_{2}\frac{1}{\varepsilon}+d\right)$ hidden layers, no more than $\mathcal{O}\big{(}\varepsilon^{-\frac{1+\delta}{k}}(\frac{1+\delta}{k}\log_{2}\frac{1}{\varepsilon})^{d-1}\big{)}$ nodes and nonzero weights, such that

[TABLE]

where $\delta>0$ can be taken very close to [math] for small enough $\varepsilon$ .

Remark 4.5.

Taking $m=2$ in Theorem 4.4, we obtain the following result: For any function $u\in\mathcal{K}^{2}_{\bm{\alpha},\bm{\beta}}(I^{d})$ , with $|u|_{\mathcal{K}^{2}_{\bm{\alpha},\bm{\beta}}}\leq 1/D_{1}$ , and $\varepsilon>0$ there exists a ReQU network $\Phi_{\varepsilon}^{u}$ with no more than $\frac{d}{2}\log_{2}\frac{1}{\varepsilon}+d$ hidden layers, no more than $\mathcal{O}\big{(}\varepsilon^{-1/2}(\frac{1}{2}\log\frac{1}{\varepsilon})^{d-1}\big{)}$ nodes and nonzero weights, that approximates $u$ with a tolerance $\varepsilon$ . A result of using ReLU networks approximating similar functions is recently given by Montanelli and Du [50]. To approximate a function in $\mathcal{K}^{2}_{\bm{\alpha},\bm{\beta}}(I^{d})$ with tolerance $\varepsilon$ , they constructed a ReLU network with $\mathcal{O}(|\log_{2}\varepsilon|\log_{2}d)$ layers and $\mathcal{O}(\varepsilon^{-\frac{1}{2}}|\log_{2}\varepsilon|^{\frac{3}{2}(d-1)+1}\log_{2}d)$ nonzero weights. Comparing the two results, we find that, while the number of layers required by ReQU networks might be larger than ReLU networks, the overall complexity of the ReQU network is $|\log_{2}\varepsilon|^{d}$ times smaller than that of ReLU network.

Remark 4.6.

When one use optimized hyperbolic cross polynomial approximation for functions in $\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}(I^{d})$ , with $|u|_{\mathcal{K}^{m}_{\bm{\alpha},\bm{\beta}}}\leq 1/D_{2}$ , the exponential growth on $d$ with a base related to $1/\varepsilon$ in the required ReQU network size is removed. Thus, in this case it seems that the curse of dimensionality does not exist any more. But we note that, the constant $D_{2}$ and the implicit constant hidden in the big $\mathcal{O}$ notation, still depend on $d$ . In practice, the error bound given by the second case may not be better than the first case.

5 Some preliminary numerical results

In this section, we present some numerical results to verify that the construction algorithms proposed are numerically stable and efficient. We first present the results of representing univariate monomials in Table 1. The maximum norm error in this table is calculated by taking the maximum difference on 100 randomly choose points in $[-1,1]$ . The results show that the ReQU network we constructed can achieve machine accuracy, which means our approach is numerically stable.

Similar results for representing univariate polynomials are given in Table 2. Here, the coefficients of the power series are generated randomly according to standard normal distribution. These results also verify our approach is stable and efficient.

Numerical tests for 2-dimensional polynomials in tensor-product space and hyperbolic cross space are presented in Tables 3 and 4, respectively. The coefficients of corresponding power series are all randomly generated according to standard normal distribution. The results verify the stability and efficiency of our method.

Next, we present some results of approximated 1-dimensional and 2-dimensional smooth functions using our approach, and compare them with trained ReLU network approximations. We first show the results of approximating $\sin(x)$ using ReQU network of our approach and ReLU network with randomly initialized coefficients. The ReQU network is constructed using proposed method based on a polynomial approximation of degree $8$ and then trained by gradient descent method. The result is shown in the left plot of Fig. 3. For the ReLU network approximation, we take 5 ReLU networks with same structure (8 layers of hidden nodes with each layer has 64 ReLU nodes, full connected) are trained using mini-batch stochastic gradient descent method. The best result among the 5 ReLU networks is shown in the right plot of Fig. 3. Note that the number of hidden nodes used by the ReQU network is less than $64$ , and it give much better results than the trained ReLU network. By training the constructed ReQU network, the approximation error can be further reduced. Similar results for approximating 2-dimensional function $\sin(x)\sin(y)$ are presented in Fig. 4.

6 Conclusion and future work

In this paper, we gave constructive proofs of some error bounds for approximating smooth functions by deep neural networks using RePU function as the activation functions. The proofs rely on the fact that polynomials can be represented by RePU networks with no approximation error. We construct several optimal algorithms for such representations, in which polynomials of degree no more than $n$ are converted into a ReQU network with $\mathcal{O}(\log_{2}n)$ layers, and the size of the network is of the same scale as the dimension of the polynomial space to be approximated. Then by using the classical polynomial approximation theory, we obtain upper error bounds for ReQU networks approximating smooth functions, which show clear advantages of using ReQU activation function, comparing to the existing results for ReLU networks. In general, the ReLU network required to approximate a sufficient smooth function, is $\mathcal{O}(\log\frac{1}{\varepsilon})$ times larger than the corresponding ReQU network. Here $\varepsilon$ is the approximation error. To achieve $\varepsilon$ -approximation for $f\in B^{\infty}_{\alpha,\beta}$ , the number of layer of ReQU network required to obtain this approximation is $\mathcal{O}(\log_{2}\log\frac{1}{\varepsilon})$ , while the corresponding best known results is $\mathcal{O}(\log\frac{1}{\varepsilon})$ for ReLU network. For high dimensional functions with bounded mixed derivatives, we give error bounds that have a weaker exponentially dependence on $d$ , by using hyperbolic cross/sparse grid spectral approximation, in particular if optimized hyperbolic cross polynomial projections are used, there is no term related to $\varepsilon$ is exponentially dependent on $d$ . Since only global polynomial approximations are considered in this paper, the results obtained also hold for deep networks with non-rectified power units. The use of rectified units gives the neural network the ability to approximate piecewise smooth functions efficiently, which will be analyzed in a separate paper.

Our constructions of RePU network also reveal the close relation between the depth of the RePU network and the “order” of polynomial approximation. The advantage of using deep over shallow neural ReQU networks is clearly shown by our constructive proofs: by using one hidden layer, a ReQU network can only represent piecewise quadratic polynomials; by using $n$ hidden layers, a ReQU network can represent piecewise polynomials of degree up to $\mathcal{O}(2^{n})$ . The ReQU networks we built for approximating smooth functions all have a tree-like structure, and are sparsely connected. This may give some hints on how to design appropriate structures of neural networks for some practical applications.

We have shown theoretically that for approximating sufficient smooth functions, ReQU networks are superior to ReLU networks in terms of approximation error. We also present efficient and stable algorithm to construct ReQU network based on polynomial approximation. Our preliminary results demonstrated that our constructions are numerically stable and efficient. The constructed neural network can be regarded as a good initial of RePU network and further trained to get better results. For low dimensional problems, this approach is much more accurate than the results obtained by direct training a randomly initialized ReLU neural networks.

In practical applications, the functions to be approximated may have different kinds of non-smoothness, which are problem dependent. The training method is another key factor that affects the application of neural networks. We will continue our study in these directions. In particular, we will study the approximation error of piecewise smooth functions with deep ReQU networks, and investigate whether those popular training methods proposed to train ReLU networks are efficient for training RePU networks. Meanwhile, we will try deep RePU networks on some practical problems where the underlying functions are smooth, e.g. minimum action methods for large PDE systems[61], PDEs with random coefficients[62], and moment closure problem in complex fluid [63] and turbulence modeling[64], etc.

Acknowledgments

We are indebted to Prof. Jie Shen and Prof. Li-Lian Wang for their stimulating conversations on spectral methods. We would like also to think Prof. Christoph Schwab and Prof. Hrushikesh N. Mhaskar for providing us some related references. This work was partially supported by China National Program on Key Basic Research Project 2015CB856003, NNSFC Grant 11771439, 91852116, and China Science Challenge Project, no. TZ2018001. The computations were performed on the PC clusters of State Key Laboratory of Scientific and Engineering Computing of Chinese Academy of Sciences.

Bibliography64

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Warren S. Mc Culloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics , 5(4):115–133, 1943.
2[2] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation , 18(7):1527–1554, 2006.
3[3] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems , pages 153–160, 2007.
4[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25 , pages 1097–1105. Curran Associates, Inc., 2012.
5[5] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, and Tara Sainath. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. , 29, 2012.
6[6] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature , 521(7553):436–444, 2015.
7[7] Jiequn Han, Linfeng Zhang, Roberto Car, and Weinan E. Deep potential: A general representation of a many-body potential energy surface. Communications in Computational Physics , 23(3), 2018.
8[8] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. PNAS , 115(34):8505–8510, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Better Approximations of High Dimensional Smooth

Abstract

keywords:

1 Introduction

2 Approximation of univariate functions by deep RePU networks

2.1 Approximation by deep ReQU networks

Lemma 2.1**.**

Proof 2.2**.**

Remark 2.3**.**

2.1.1 Optimal realizations of polynomials by deep ReQU networks with no error

Theorem 2.4**.**

Proof 2.5**.**

Theorem 2.6**.**

Proof 2.7**.**

Remark 2.8**.**

2.1.2 Error bounds of approximating smooth functions by

Theorem 2.9**.**

Proof 2.10**.**

Remark 2.11**.**

Theorem 2.12**.**

Proof 2.13**.**

2.2 Approximation by deep networks using general rectified power units

Theorem 2.14**.**

Proof 2.15**.**

3 Approximation of multivariate functions

3.1 Deep ReQU network representations of multivariate polynomials

Theorem 3.1**.**

Proof 3.2**.**

Theorem 3.3**.**

3.2 Error bounds of approximating multivariate

Theorem 3.4**.**

Remark 3.5**.**

Remark 3.6**.**

4 High-dimensional functions with sparse

4.1 A brief review of hyperbolic cross approximations and sparse grids

4.2 Error bounds of deep ReQU network approximation for

Theorem 4.1**.**

Proof 4.2**.**

Remark 4.3**.**

Theorem 4.4**.**

Remark 4.5**.**

Remark 4.6**.**

5 Some preliminary numerical results

6 Conclusion and future work

Acknowledgments

Lemma 2.1.

Proof 2.2.

Remark 2.3.

Theorem 2.4.

Proof 2.5.

Theorem 2.6.

Proof 2.7.

Remark 2.8.

Theorem 2.9.

Proof 2.10.

Remark 2.11.

Theorem 2.12.

Proof 2.13.

Theorem 2.14.

Proof 2.15.

Theorem 3.1.

Proof 3.2.

Theorem 3.3.

Theorem 3.4.

Remark 3.5.

Remark 3.6.

Theorem 4.1.

Proof 4.2.

Remark 4.3.

Theorem 4.4.

Remark 4.5.

Remark 4.6.