Towards a regularity theory for ReLU networks -- chain rule and global   error estimates

Julius Berner; Dennis Elbr\"achter; Philipp Grohs; Arnulf Jentzen

arXiv:1905.04992·cs.LG·November 12, 2020

Towards a regularity theory for ReLU networks -- chain rule and global error estimates

Julius Berner, Dennis Elbr\"achter, Philipp Grohs, Arnulf Jentzen

PDF

TL;DR

This paper develops a rigorous derivative concept for ReLU neural networks that satisfies the chain rule and provides a method to extend local approximation results to global estimates, enhancing understanding of neural network regularity.

Contribution

It introduces a derivative framework compatible with the chain rule for ReLU networks and offers a technique to convert local approximation results into global estimates.

Findings

01

A new derivative concept satisfying the chain rule for ReLU networks

02

Method to extend local approximation results to global estimates

03

Application to high-dimensional PDEs in deep learning

Abstract

Although for neural networks with locally Lipschitz continuous activation functions the classical derivative exists almost everywhere, the standard chain rule is in general not applicable. We will consider a way of introducing a derivative for neural networks that admits a chain rule, which is both rigorous and easy to work with. In addition we will present a method of converting approximation results on bounded domains to global (pointwise) estimates. This can be used to extend known neural network approximation theory to include the study of regularity properties. Of particular interest is the application to neural networks with ReLU activation function, where it contributes to the understanding of the success of deep learning methods for high-dimensional partial differential equations.

Equations70

Φ = ((A_{k}, b_{k}))_{k = 1}^{L},

Φ = ((A_{k}, b_{k}))_{k = 1}^{L},

R Φ = W_{L} \circ ReLU \circ W_{L - 1} \circ \dots \circ ReLU \circ W_{1},

R Φ = W_{L} \circ ReLU \circ W_{L - 1} \circ \dots \circ ReLU \circ W_{1},

ReLU (x) := (max {0, x_{1}}, \dots, max {0, x_{N}})

ReLU (x) := (max {0, x_{1}}, \dots, max {0, x_{N}})

Φ = ((A_{k}, b_{k}))_{k = 1}^{L}, Ψ = ((\tilde{A}_{k}, \tilde{b}_{k}))_{k = 1}^{\tilde{L}}

Φ = ((A_{k}, b_{k}))_{k = 1}^{L}, Ψ = ((\tilde{A}_{k}, \tilde{b}_{k}))_{k = 1}^{\tilde{L}}

\displaystyle\begin{split}&\Psi\odot\Phi:=\\ &\quad\big{(}((A_{k},b_{k}))_{k=1}^{L-1},(\tilde{A}_{1}A_{L},\tilde{A}_{1}b_{L}+\tilde{b}_{1}),((\tilde{A}_{k},\tilde{b}_{k}))_{k=2}^{\tilde{L}}\big{)}.\end{split}

\displaystyle\begin{split}&\Psi\odot\Phi:=\\ &\quad\big{(}((A_{k},b_{k}))_{k=1}^{L-1},(\tilde{A}_{1}A_{L},\tilde{A}_{1}b_{L}+\tilde{b}_{1}),((\tilde{A}_{k},\tilde{b}_{k}))_{k=2}^{\tilde{L}}\big{)}.\end{split}

R (Ψ ⊙ Φ) = R Ψ \circ R Φ .

R (Ψ ⊙ Φ) = R Ψ \circ R Φ .

(D (u \circ v)) (x) = (D u) (v (x)) \cdot (D v) (x) .

(D (u \circ v)) (x) = (D u) (v (x)) \cdot (D v) (x) .

H (x) := diag (\mathbbm 1_{(0, \infty)} (x_{1}), \dots, \mathbbm 1_{(0, \infty)} (x_{N}))

H (x) := diag (\mathbbm 1_{(0, \infty)} (x_{1}), \dots, \mathbbm 1_{(0, \infty)} (x_{N}))

D Φ := A_{L} \cdot H (R_{L - 1} Φ) \cdot A_{L - 1} \cdot \dots \cdot H (R_{1} Φ) \cdot A_{1} .

D Φ := A_{L} \cdot H (R_{L - 1} Φ) \cdot A_{L - 1} \cdot \dots \cdot H (R_{1} Φ) \cdot A_{1} .

(D Φ) (x) = (D (R Φ)) (x) .

(D Φ) (x) = (D (R Φ)) (x) .

L_{i} := {x \in R^{d} : w_{i} (x) = 0} = {x \in R^{d} : v_{i} (x) \leq 0} .

L_{i} := {x \in R^{d} : w_{i} (x) = 0} = {x \in R^{d} : v_{i} (x) \leq 0} .

(D w_{i}) (x) = 0 for almost every x \in L_{i} .

(D w_{i}) (x) = 0 for almost every x \in L_{i} .

D w_{i} = \mathbbm 1_{R^{d} \ L_{i}} \cdot D v_{i} = \mathbbm 1_{(0, \infty)} (v_{i}) \cdot D v_{i}

D w_{i} = \mathbbm 1_{R^{d} \ L_{i}} \cdot D v_{i} = \mathbbm 1_{(0, \infty)} (v_{i}) \cdot D v_{i}

D (ReLU \circ v) = H (v) \cdot D v .

D (ReLU \circ v) = H (v) \cdot D v .

(D (Ψ ⊙ Φ)) (x) = (D Ψ) (R Φ (x)) \cdot (D Φ) (x) .

(D (Ψ ⊙ Φ)) (x) = (D Ψ) (R Φ (x)) \cdot (D Φ) (x) .

D (Ψ ⊙ Φ) = D (R (Ψ ⊙ Φ)) = D (R Ψ \circ R Φ) .

D (Ψ ⊙ Φ) = D (R (Ψ ⊙ Φ)) = D (R Ψ \circ R Φ) .

\displaystyle\lim_{y\to{\mathcal{R}{\Phi}}(x)}\big{[}({\mathcal{D}}\Psi)(y)-({\mathcal{D}}\Psi)({\mathcal{R}{\Phi}}(x))\big{]}\cdot({\mathcal{D}}\Phi)(x)=0.

\displaystyle\lim_{y\to{\mathcal{R}{\Phi}}(x)}\big{[}({\mathcal{D}}\Psi)(y)-({\mathcal{D}}\Psi)({\mathcal{R}{\Phi}}(x))\big{]}\cdot({\mathcal{D}}\Phi)(x)=0.

y \to R Φ (x) lim [H (u (y)) - H (u (R Φ (x)))] \cdot (D (u \circ R Φ)) (x) = 0.

y \to R Φ (x) lim [H (u (y)) - H (u (R Φ (x)))] \cdot (D (u \circ R Φ)) (x) = 0.

y \to R Φ (x) lim \mathbbm 1_{(0, \infty)} (u_{i} (y)) = \mathbbm 1_{(0, \infty)} (u_{i} (R Φ (x)))

y \to R Φ (x) lim \mathbbm 1_{(0, \infty)} (u_{i} (y)) = \mathbbm 1_{(0, \infty)} (u_{i} (R Φ (x)))

(D (u_{i} \circ R Φ)) (x) = 0

(D (u_{i} \circ R Φ)) (x) = 0

(\overset{ˉ}{D} ϱ) (x_{i}) := {0, (D ϱ) (x_{i}), x_{i} \in S else .

(\overset{ˉ}{D} ϱ) (x_{i}) := {0, (D ϱ) (x_{i}), x_{i} \in S else .

{x \in R^{d} : w_{i} (x) = s}, s \in S .

{x \in R^{d} : w_{i} (x) = s}, s \in S .

∥ [D Ψ \circ R Φ - D u \circ R Φ] D Φ ∥_{L^{\infty}}

∥ [D Ψ \circ R Φ - D u \circ R Φ] D Φ ∥_{L^{\infty}}

∥ D Ψ - D u ∥_{L^{\infty}} ∥ D Φ ∥_{L^{\infty}} .

∥ D Ψ - D u ∥_{L^{\infty}} ∥ D Φ ∥_{L^{\infty}} .

∥ f - R Φ_{ε, B} ∥_{L^{\infty} (I_{B})} \leq ε,

∥ f - R Φ_{ε, B} ∥_{L^{\infty} (I_{B})} \leq ε,

∥ D f - D Φ_{ε, B} ∥_{L^{\infty} (I_{B})} \leq c ε^{r} .

∥ D f - D Φ_{ε, B} ∥_{L^{\infty} (I_{B})} \leq c ε^{r} .

∥ (D f) (x) ∥_{2} \leq c (1 + ∥ x ∥_{2}^{κ}) .

∥ (D f) (x) ∥_{2} \leq c (1 + ∥ x ∥_{2}^{κ}) .

R Φ_{B}^{char} (x) = 1, x \in I_{B}, R Φ_{B}^{char} (x) = 0, x \in / I_{B + 1} .

R Φ_{B}^{char} (x) = 1, x \in I_{B}, R Φ_{B}^{char} (x) = 0, x \in / I_{B + 1} .

B_{ε} \in O (ε^{- 1}) and b_{ε} \in O (ε^{- κ - 1}) .

B_{ε} \in O (ε^{- 1}) and b_{ε} \in O (ε^{- κ - 1}) .

∣ f (x) - R Φ_{ε} (x) ∣ \leq ε (1 + ∥ x ∥_{2}^{κ + 2})

∣ f (x) - R Φ_{ε} (x) ∣ \leq ε (1 + ∥ x ∥_{2}^{κ + 2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods*Communicated@Fast*How Do I Communicate to Expedia?

Full text

Towards a regularity theory for ReLU networks – chain rule and global error estimates

Julius Berner1, Dennis Elbrächter1, Philipp Grohs3, Arnulf Jentzen4

1Faculty of Mathematics, University of Vienna

Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria

3Faculty of Mathematics and Research Platform DataScience@UniVienna, University of Vienna

Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria

4Department of Mathematics, ETH Zürich

Rämistrasse 101, 8092 Zürich, Switzerland

Abstract

Although for neural networks with locally Lipschitz continuous activation functions the classical derivative exists almost everywhere, the standard chain rule is in general not applicable. We will consider a way of introducing a derivative for neural networks that admits a chain rule, which is both rigorous and easy to work with. In addition we will present a method of converting approximation results on bounded domains to global (pointwise) estimates. This can be used to extend known neural network approximation theory to include the study of regularity properties. Of particular interest is the application to neural networks with ReLU activation function, where it contributes to the understanding of the success of deep learning methods for high-dimensional partial differential equations.

I Introduction

It has been observed that deep neural networks exhibit the remarkable capability of overcoming the curse of dimensionality in a number of different scenarios. In particular, for certain types of high-dimensional partial differential equations (PDEs) there are promising empirical observations [1, 2, 3, 4, 5, 6, 7] backed by theoretical results for both the approximation error [8, 9, 10, 11] as well as the generalization error [12]. In this context it becomes relevant to not only show how well a given function of interest can be approximated by neural networks but also to extend the study to the derivative of this function. A number of recent publications [13, 14, 15] have investigated the required size of a network which is sufficient to approximate certain interesting (classes of) functions within a given accuracy. This is achieved, first, by considering the approximation of basic functions by very simple networks and, subsequently, by combining those networks in order to approximate more difficult structures. To extend this approach to include the regularity of the approximation, one requires some kind of chain rule for the composition of neural networks. For neural networks with differentiable activation function the standard chain rule is sufficient. It, however, fails when considering neural networks with an activation function, which is not everywhere differentiable. Although locally Lipschitz continuous functions are w.r.t the Lebesgue measure almost everywhere (a.e.) differentiable, the standard chain rule is not applicable, as, in general, it does not hold even in an ’almost everywhere’ sense. We will introduce derivatives of neural networks in a way that admits a chain rule which is both rigorous as well as easy to work with. Chain rules for functions which are not everywhere differentiable have been considered in a more general setting in e.g. [16, 17]. We employ the specific structure of neural networks to get stronger results using simpler arguments. In particular it allows for a stability result, i.e. Lemma III.3, the application of which will be discussed in Section V. We would also like to mention a very recent work [18] about approximation in Sobolev norms, where they deal with the issue by using a general bound for the Sobolev norm of the composition of functions from the Sobolev space $W^{1,\infty}$ . Note however that this approach leads to a certain factor depending on the dimensions of the domains of the functions, which can be avoided with our method. For ease of exposition, we formulate our results for neural networks with the ReLU activation function. We, however, consider in Section IV how such a chain rule can be obtained for any activation function which is locally Lipschitz continuous (with at most countably many points at which it is not differentiable). In Section V we briefly sketch how the results from Section III can be utilized to get approximation results for certain classes of functions. Subsequently, in Section VI, we present a general method of deriving global error estimates from such approximation results, which are naturally obtained for bounded domains. Ultimately, we discuss how our results can be used to extend known theory, enabling the further study of the approximation of PDE solutions by neural networks.

II Setting

As in [14], we consider a neural network $\Phi$ to be a finite sequence of matrix-vector pairs, i.e.

[TABLE]

where $A_{k}\in{\mathbb{R}}^{N_{k}\times N_{k-1}}$ and $b_{k}\in{\mathbb{R}}^{N_{k}}$ for some depth $L\in{\mathbb{N}}$ and layer dimensions $N_{0},N_{1},\dots,N_{L}\in{\mathbb{N}}$ . The realization of the neural network $\Phi$ is the function ${\mathcal{R}{\Phi}}\colon{\mathbb{R}}^{N_{0}}\to{\mathbb{R}}^{N_{L}}$ given by

[TABLE]

where $W_{k}(x)=A_{k}x+b_{k}$ for every $x\in{\mathbb{R}}^{N_{k}}$ and where

[TABLE]

for every $x\in{\mathbb{R}}^{N}$ . We distinguish between a neural network and its realization, since $\Phi$ uniquely induces ${\mathcal{R}{\Phi}}$ , while in general there can be multiple non-trivially different neural networks with the same realization. The representation of a neural network as a structured set of weights as in (1) allows the introduction of notions of network sizes. While there are slight differences between various publications, commonly considered quantities are the depth (i.e. number of affine transformations), the connectivity (i.e. number of non-zero entries of the $A_{k}$ and $b_{k}$ ), and the weight bound (i.e. maximum of the absolute values of the entries of the $A_{k}$ and $b_{k}$ ). In [15] it has been shown that these three quantities determine the length of a bit string which is sufficient to encode the network with a prescribed quantization error. In the following let

[TABLE]

be neural networks with matching dimensions in the sense that ${{\mathcal{R}{\Phi}}\colon{\mathbb{R}}^{d}\to{\mathbb{R}}^{m}}$ and ${{\mathcal{R}{\Psi}}\colon{\mathbb{R}}^{m}\to{\mathbb{R}}^{n}}$ . We then define their composition as

[TABLE]

Direct computation shows

[TABLE]

Note that the realization ${\mathcal{R}{\Phi}}$ of a neural network $\Phi$ is continuous piecewise linear (CPL) as a composition of CPL functions. Consequently, it is Lipschitz continuous and the realization ${\mathcal{R}{\Phi}}$ is almost everywhere differentiable by Rademacher’s theorem. In particular all three functions in (6) are a.e. differentiable. This, however, is not sufficient to get the derivative of ${\mathcal{R}{(}}\Psi\odot\Phi)$ from the derivatives of ${\mathcal{R}{\Psi}}$ and ${\mathcal{R}{\Phi}}$ by use of the classical chain rule. Consider the very simple counterexample of $u(x):=\operatorname{ReLU}(x)$ and $v(x):=0$ and formally apply the chain rule, i.e.

[TABLE]

Even though $({D}u)(y)$ is well-defined for every $y\in{\mathbb{R}}\backslash\{0\}$ , the expression $({D}u)(v(x))$ is defined for no $x\in{\mathbb{R}}$ . In general this problem occurs when the inner function maps a set of positive measure into a set where the derivative of the outer function does not exist. Now in this case, one can directly see that setting $({D}u)(0)$ to any arbitrary value would cause (7) to provide the correct result since $({D}v)(x)=0$ .

III ReLU network derivative

We proceed by defining the derivative of an arbitrary neural network in a way such that it not only coincides a.e. with the derivative of the realization, but also admits a chain rule. To this end let $H\colon{\mathbb{R}}^{N}\to{\mathbb{R}}^{N\times N}$ be the function given by

[TABLE]

for every $x=(x_{1},\dots,x_{N})\in{\mathbb{R}}^{N}$ and let $\mathcal{R}_{K}\Phi:={\mathcal{R}{(}}(A_{k},b_{k}))_{k=1}^{K}$ . We then define the neural network derivative of $\Phi$ as the function ${\mathcal{D}}\Phi\colon{\mathbb{R}}^{N_{0}}\to{\mathbb{R}}^{N_{L}\times N_{0}}$ given by

[TABLE]

Note that this definition is motivated by formally applying the chain rule with the convention that the derivative of $\max\{0,\,\cdot\,\}$ is zero at the origin. Now we need to verify that this is justified.

Theorem III.1.

It holds for almost every $x\in{\mathbb{R}}^{d}$ that

[TABLE]

Proof.

Let $v\colon{\mathbb{R}}^{d}\to{\mathbb{R}}^{N}$ be a locally Lipschitz continuous function, define $w:=\operatorname{ReLU}\circ\,v$ , and

[TABLE]

We now use an observation about differentiability on level sets (see e.g. [19, Thm 3.3(i)]), which states that

[TABLE]

As $w_{i}(x)=v_{i}(x)$ for every $x\in{\mathbb{R}}^{d}\backslash L_{i}$ , we get a.e.

[TABLE]

and consequently

[TABLE]

The claim follows by induction over the layers $K=1,\dots,L$ of $\Phi$ , using (14) with $v=\mathcal{R}_{K}\Phi$ for the induction step. ∎

Note that even for convex ${\mathcal{R}{\Phi}}$ the values of ${\mathcal{D}}\Phi$ on the nullset do not necessarily lie in the respective subdifferentials of ${\mathcal{R}{\Phi}}$ , as can be seen in Figure 1. Although Theorem III.1 holds regardless of which value is chosen for the derivative of $\max\{0,\,\cdot\,\}$ at the origin, no choice will guarantee that all values of ${\mathcal{D}}\Phi$ lie in the respective subdifferentials of ${\mathcal{R}{\Phi}}$ . Here we have set the derivative at the origin to zero, following the convention of software implementations for deep learning applications, e.g. TensorFlow and PyTorch. Using (LABEL:eq:concDef) and (9) one can verify by direct computation that ${\mathcal{D}}$ obeys the chain rule.

Corollary III.2.

It holds for every $x\in{\mathbb{R}}^{d}$ that

[TABLE]

Note that (15) is well-defined as ${\mathcal{D}}\Psi$ exists everywhere, although it only coincides with ${D}({\mathcal{R}{\Psi}})$ almost everywhere. Theorem III.1 however guarantees that we still have a.e.

[TABLE]

Next we provide a technical result dealing with the stability of our chain rule, which will prove to be useful in Section V.

Lemma III.3.

It holds for almost every $x\in{\mathbb{R}}^{d}$ that

[TABLE]

Proof.

We first show for every locally Lipschitz continuous function $u\colon{\mathbb{R}}^{m}\to{\mathbb{R}}^{N}$ and for almost every $x\in{\mathbb{R}}^{d}$ that

[TABLE]

If $u_{i}({\mathcal{R}{\Phi}}(x))\neq 0$ we have

[TABLE]

as $u_{i}$ is continuous and ${\mathbbm{1}}_{(0,\infty)}$ is continuous on ${\mathbb{R}}\backslash\{0\}$ . Furthermore, [19, Thm 3.3(i)] implies that

[TABLE]

for almost every $x\in{\mathbb{R}}^{d}$ with $u_{i}({\mathcal{R}{\Phi}}(x))=0$ . Since a finite union of nullsets is again a nullset, this proves the claim (18). The lemma follows by induction over the layers $K=1,\dots,\tilde{L}$ of $\Psi$ and applying (18) with $u=\mathcal{R}_{K}\Psi$ . ∎

IV General Activation Functions

As mentioned in the introduction, it is possible to replace the ReLU activation function in (2) by some locally Lipschitz continuous, component-wise applied function $\varrho\colon{\mathbb{R}}\to{\mathbb{R}}$ with an at most countably large set $S$ of points where $\varrho$ is not differentiable. Specifically, one can define the neural network derivative (with activation function $\varrho$ ) as in (9) with $\mathbbm{1}_{(0,\infty)}(x_{i})$ in (8) replaced by

[TABLE]

The chain rule can, again, be checked by direct computation and it is straightforward to adapt Theorem III.1 to this more general setting by considering the level sets

[TABLE]

If additionally $\bar{{D}}\varrho$ is continuous on ${\mathbb{R}}\setminus S$ , the proof of Lemma III.3 translates without any modifications.

V Utilization in Approximation Theory

These results can now be employed to bound the $L^{\infty}$ -norm of ${\mathcal{D}}(\Psi\circ\Phi)-{D}(u\circ\,v)$ , given corresponding estimates for the approximation of $u$ and $v$ by $\Psi$ and $\Phi$ , respectively. Here, one has to take some care when bounding the term

[TABLE]

by

[TABLE]

Again it can happen that ${\mathcal{R}{\Phi}}$ maps a set of positive measure into a nullset where the estimate for the approximation of ${D}u$ by ${\mathcal{D}}\Psi$ in the essential supremum norm is not valid. However, using the stability result in Lemma III.3 one can for almost every $x\in{\mathbb{R}}^{d}$ shift to a sufficiently close point $y\approx{\mathcal{R}{\Phi}}(x)$ where the estimate holds. In [13] Yarotsky explicitly constructs networks whose realization is a linear interpolation111The interpolation points are uniformly distributed over the domain of approximation and their number grows exponentially with the size of the networks. of the squaring function (see Fig. 1 for illustration), which directly gives an estimate on the approximation rate for the derivatives. These simple networks can then be combined to get networks approximating multiplication, polynomials and eventually, by means of e.g. local Taylor approximation, functions $f$ whose first $n\geq 1$ (weak) derivatives are bounded. This leads to estimates of the form

[TABLE]

with $I_{B}=[-B,B]^{d}$ , including estimates for the scaling of the size of the network $\Phi_{\varepsilon,B}$ w.r.t. $B$ and $\varepsilon$ . As these constructions are based on composing simpler functions with known estimates one can now employ Theorem III.1 and Corollary III.2 to show that the derivatives of those networks also approximate the derivative of the function, i.e.

[TABLE]

Such constructive approaches can further be found in [8], in [14] for $\beta$ -cartoon-like functions, in [20] for $(\bm{b},\varepsilon)$ -holomorphic maps, and in [15] for high-frequent sinusoidal functions.

VI Global Error Estimates

The error estimates above are usually only sensible for bounded domains, as the realization of a neural network is always CPL with a finite number of pieces. We briefly discuss a general way of transforming them into global pointwise error estimates, which can be useful in the context of PDEs (see e.g. [9, 10]). In the following assume that we have a function $f$ with an at most polynomially growing derivative, i.e.

[TABLE]

Denote by $\Phi_{B}^{\operatorname{char}}$ a neural network which represents the $d$ -dimensional approximate characteristic function of $I_{B}$ , i.e. ${\mathcal{R}{\Phi_{B}^{\operatorname{char}}}}(x)\in[0,1]$ and

[TABLE]

See [15, Proof of Thm. VIII.3] for such a construction. Further let $\Phi_{\varepsilon,b}^{\operatorname{mult}}$ be the neural network approximating the multiplication function on $[-b,b]^{2}$ with error $\varepsilon$ (see e.g. [20, Prop. 3.1]).

Now we define the global approximation networks $\Phi_{\varepsilon}$ as the composition of $\Phi_{\varepsilon/2,b_{\varepsilon}}^{\operatorname{mult}}$ with the parallelization of $\Phi_{B_{\varepsilon}}^{\operatorname{char}}$ and $\Phi_{\varepsilon/2,B_{\varepsilon}+1}$ for suitable

[TABLE]

See Figure 2 for an illustration and e.g. [14, Def. 2.7] for a formal definition of parallelization. Considering the errors on $I_{B}$ , $I_{B+1}\backslash I_{B}$ and ${\mathbb{R}}^{d}\backslash I_{B+1}$ leads to global estimates, i.e. for every $x\in{\mathbb{R}}^{d}$

[TABLE]

and, by use of the chain rule III.2, for almost every $x\in{\mathbb{R}}^{d}$

[TABLE]

Due to the logarithmic size scaling of the multiplication network, the size of $\Phi_{\varepsilon}$ can be bounded by the size of $\Phi_{\varepsilon/2,B_{\varepsilon}+1}$ plus an additional term in $\mathcal{O}(d+\kappa\log\varepsilon^{-1})$ .

VII Application to PDEs

Analyzing the regularity properties of neural networks was motivated by the recent successful application of deep learning methods to PDEs [2, 3, 4, 5, 6, 7, 11]. Initiated by empirical experiments [1] it has been proven that neural networks are capable of overcoming the curse of dimensionality for solving so-called Kolmogorov PDEs [12]. More precisely, the solution to the empirical risk minimization problem over a class of neural networks approximates the solution of the PDE up to error $\varepsilon$ with high probability and with size of the networks and number of samples scaling only polynomially in the dimension $d$ and $\varepsilon^{-1}$ . The above requires a suitable learning problem and a sufficiently good approximation of the solution function by neural networks. For Kolmogorov PDEs, this boils down to calculating global Lipschitz coefficients and error estimates for neural networks approximating the initial condition and coefficient functions (see e.g. [9, 10]). Employing estimates of the form (26) one can bound the derivative on $I_{B}$ , i.e.

[TABLE]

Using mollification and the mean value theorem we can establish local Lipschitz estimates, i.e. for all $x,y\in(-B,B)^{d}$ that

[TABLE]

and corresponding linear growth bounds

[TABLE]

Similarly, one can use (31) to obtain estimates of the form

[TABLE]

for all $x,y\in{\mathbb{R}}^{d}$ (which are demanded in [10, Theorem 1.1]). Moreover, note that the capability to produce approximation results which include error estimates for the derivative is of significant independent interest. Various numerical methods (for instance Galerkin methods) rely on bounding the error in some Sobolev norm $\|\cdot\|_{W^{1,p}}$ , which requires estimates of the derivative differences. We believe that the possibility to obtain regularity estimates significantly contributes to the mathematical theory of neural networks and allows for further advances in the numerical approximation of high dimensional partial differential equations.

VIII Relation to backpropagation in training

The approach discussed here could further be applied to the training of neural networks by (stochastic) gradient descent. Note, however, that this is a slightly different setting. From the approximation theory perspective we were interested in the derivative of $x\mapsto{\mathcal{R}{\Phi}}(x)$ , while in training one requires the derivative of $\Phi\mapsto{\mathcal{R}{\Phi}}(x^{*})$ for some fixed sample $x^{*}$ . In particular this function is no longer CPL but rather continuous piecewise polynomial. While this would necessitate some technical modifications, we believe that it should be possible to employ the method used here in order to show that the gradient of $\Phi\mapsto{\mathcal{R}{\Phi}}(x^{*})$ coincides a.e. with what is computed by backpropagation using the convention of setting the derivative of $\max\{0,\cdot\}$ to zero at the origin (as well as similar conventions for e.g. max-pooling).

Acknowledgment

The research of JB and DE was supported by the Austrian Science Fund (FWF) under grants I3403-N32 and P 30148.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Beck, S. Becker, P. Grohs, N. Jaafari, and A. Jentzen, “Solving stochastic differential equations and Kolmogorov equations by means of deep learning,” ar Xiv:1806.00421 , 2018.
2[2] W. E, J. Han, and A. Jentzen, “Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations,” Communications in Mathematics and Statistics , vol. 5, no. 4, pp. 349–380, 2017.
3[3] J. Han, A. Jentzen, and W. E, “Solving high-dimensional partial differential equations using deep learning,” ar Xiv:1707.02568 , 2017.
4[4] J. Sirignano and K. Spiliopoulos, “DGM: A deep learning algorithm for solving partial differential equations,” ar Xiv:1708.07469 , 2017.
5[5] M. Fujii, A. Takahashi, and M. Takahashi, “Asymptotic Expansion as Prior Knowledge in Deep Learning Method for high dimensional BSD Es,” ar Xiv:1710.07030 , 2017.
6[6] Y. Khoo, J. Lu, and L. Ying, “Solving parametric PDE problems with artificial neural networks,” ar Xiv:1707.03351 , 2017.
7[7] W. E and B. Yu, “The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems,” ar Xiv:1710.00211 , 2017.
8[8] D. Elbrächter, P. Grohs, A. Jentzen, and C. Schwab, “DNN Expression Rate Analysis of high-dimensional PD Es: Application to Option Pricing,” ar Xiv:1809.07669 , 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Towards a regularity theory for ReLU networks – chain rule and global error estimates

Abstract

I Introduction

II Setting

III ReLU network derivative

Theorem III.1**.**

Proof.

Corollary III.2**.**

Lemma III.3**.**

Proof.

IV General Activation Functions

V Utilization in Approximation Theory

VI Global Error Estimates

VII Application to PDEs

VIII Relation to backpropagation in training

Acknowledgment

Theorem III.1.

Corollary III.2.

Lemma III.3.