Towards a regularity theory for ReLU networks -- chain rule and global error estimates
Julius Berner, Dennis Elbr\"achter, Philipp Grohs, Arnulf Jentzen

TL;DR
This paper develops a rigorous derivative concept for ReLU neural networks that satisfies the chain rule and provides a method to extend local approximation results to global estimates, enhancing understanding of neural network regularity.
Contribution
It introduces a derivative framework compatible with the chain rule for ReLU networks and offers a technique to convert local approximation results into global estimates.
Findings
A new derivative concept satisfying the chain rule for ReLU networks
Method to extend local approximation results to global estimates
Application to high-dimensional PDEs in deep learning
Abstract
Although for neural networks with locally Lipschitz continuous activation functions the classical derivative exists almost everywhere, the standard chain rule is in general not applicable. We will consider a way of introducing a derivative for neural networks that admits a chain rule, which is both rigorous and easy to work with. In addition we will present a method of converting approximation results on bounded domains to global (pointwise) estimates. This can be used to extend known neural network approximation theory to include the study of regularity properties. Of particular interest is the application to neural networks with ReLU activation function, where it contributes to the understanding of the success of deep learning methods for high-dimensional partial differential equations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods*Communicated@Fast*How Do I Communicate to Expedia?
Towards a regularity theory for ReLU networks – chain rule and global error estimates
Julius Berner1, Dennis Elbrächter1, Philipp Grohs3, Arnulf Jentzen4
1Faculty of Mathematics, University of Vienna
Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
3Faculty of Mathematics and Research Platform DataScience@UniVienna, University of Vienna
Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
4Department of Mathematics, ETH Zürich
Rämistrasse 101, 8092 Zürich, Switzerland
Abstract
Although for neural networks with locally Lipschitz continuous activation functions the classical derivative exists almost everywhere, the standard chain rule is in general not applicable. We will consider a way of introducing a derivative for neural networks that admits a chain rule, which is both rigorous and easy to work with. In addition we will present a method of converting approximation results on bounded domains to global (pointwise) estimates. This can be used to extend known neural network approximation theory to include the study of regularity properties. Of particular interest is the application to neural networks with ReLU activation function, where it contributes to the understanding of the success of deep learning methods for high-dimensional partial differential equations.
I Introduction
It has been observed that deep neural networks exhibit the remarkable capability of overcoming the curse of dimensionality in a number of different scenarios. In particular, for certain types of high-dimensional partial differential equations (PDEs) there are promising empirical observations [1, 2, 3, 4, 5, 6, 7] backed by theoretical results for both the approximation error [8, 9, 10, 11] as well as the generalization error [12]. In this context it becomes relevant to not only show how well a given function of interest can be approximated by neural networks but also to extend the study to the derivative of this function. A number of recent publications [13, 14, 15] have investigated the required size of a network which is sufficient to approximate certain interesting (classes of) functions within a given accuracy. This is achieved, first, by considering the approximation of basic functions by very simple networks and, subsequently, by combining those networks in order to approximate more difficult structures. To extend this approach to include the regularity of the approximation, one requires some kind of chain rule for the composition of neural networks. For neural networks with differentiable activation function the standard chain rule is sufficient. It, however, fails when considering neural networks with an activation function, which is not everywhere differentiable. Although locally Lipschitz continuous functions are w.r.t the Lebesgue measure almost everywhere (a.e.) differentiable, the standard chain rule is not applicable, as, in general, it does not hold even in an ’almost everywhere’ sense. We will introduce derivatives of neural networks in a way that admits a chain rule which is both rigorous as well as easy to work with. Chain rules for functions which are not everywhere differentiable have been considered in a more general setting in e.g. [16, 17]. We employ the specific structure of neural networks to get stronger results using simpler arguments. In particular it allows for a stability result, i.e. Lemma III.3, the application of which will be discussed in Section V. We would also like to mention a very recent work [18] about approximation in Sobolev norms, where they deal with the issue by using a general bound for the Sobolev norm of the composition of functions from the Sobolev space . Note however that this approach leads to a certain factor depending on the dimensions of the domains of the functions, which can be avoided with our method. For ease of exposition, we formulate our results for neural networks with the ReLU activation function. We, however, consider in Section IV how such a chain rule can be obtained for any activation function which is locally Lipschitz continuous (with at most countably many points at which it is not differentiable). In Section V we briefly sketch how the results from Section III can be utilized to get approximation results for certain classes of functions. Subsequently, in Section VI, we present a general method of deriving global error estimates from such approximation results, which are naturally obtained for bounded domains. Ultimately, we discuss how our results can be used to extend known theory, enabling the further study of the approximation of PDE solutions by neural networks.
II Setting
As in [14], we consider a neural network to be a finite sequence of matrix-vector pairs, i.e.
[TABLE]
where and for some depth and layer dimensions . The realization of the neural network is the function given by
[TABLE]
where for every and where
[TABLE]
for every . We distinguish between a neural network and its realization, since uniquely induces , while in general there can be multiple non-trivially different neural networks with the same realization. The representation of a neural network as a structured set of weights as in (1) allows the introduction of notions of network sizes. While there are slight differences between various publications, commonly considered quantities are the depth (i.e. number of affine transformations), the connectivity (i.e. number of non-zero entries of the and ), and the weight bound (i.e. maximum of the absolute values of the entries of the and ). In [15] it has been shown that these three quantities determine the length of a bit string which is sufficient to encode the network with a prescribed quantization error. In the following let
[TABLE]
be neural networks with matching dimensions in the sense that and . We then define their composition as
[TABLE]
Direct computation shows
[TABLE]
Note that the realization of a neural network is continuous piecewise linear (CPL) as a composition of CPL functions. Consequently, it is Lipschitz continuous and the realization is almost everywhere differentiable by Rademacher’s theorem. In particular all three functions in (6) are a.e. differentiable. This, however, is not sufficient to get the derivative of from the derivatives of and by use of the classical chain rule. Consider the very simple counterexample of and and formally apply the chain rule, i.e.
[TABLE]
Even though is well-defined for every , the expression is defined for no . In general this problem occurs when the inner function maps a set of positive measure into a set where the derivative of the outer function does not exist. Now in this case, one can directly see that setting to any arbitrary value would cause (7) to provide the correct result since .
III ReLU network derivative
We proceed by defining the derivative of an arbitrary neural network in a way such that it not only coincides a.e. with the derivative of the realization, but also admits a chain rule. To this end let be the function given by
[TABLE]
for every and let . We then define the neural network derivative of as the function given by
[TABLE]
Note that this definition is motivated by formally applying the chain rule with the convention that the derivative of is zero at the origin. Now we need to verify that this is justified.
Theorem III.1**.**
It holds for almost every that
[TABLE]
Proof.
Let be a locally Lipschitz continuous function, define , and
[TABLE]
We now use an observation about differentiability on level sets (see e.g. [19, Thm 3.3(i)]), which states that
[TABLE]
As for every , we get a.e.
[TABLE]
and consequently
[TABLE]
The claim follows by induction over the layers of , using (14) with for the induction step. ∎
Note that even for convex the values of on the nullset do not necessarily lie in the respective subdifferentials of , as can be seen in Figure 1. Although Theorem III.1 holds regardless of which value is chosen for the derivative of at the origin, no choice will guarantee that all values of lie in the respective subdifferentials of . Here we have set the derivative at the origin to zero, following the convention of software implementations for deep learning applications, e.g. TensorFlow and PyTorch. Using (LABEL:eq:concDef) and (9) one can verify by direct computation that obeys the chain rule.
Corollary III.2**.**
It holds for every that
[TABLE]
Note that (15) is well-defined as exists everywhere, although it only coincides with almost everywhere. Theorem III.1 however guarantees that we still have a.e.
[TABLE]
Next we provide a technical result dealing with the stability of our chain rule, which will prove to be useful in Section V.
Lemma III.3**.**
It holds for almost every that
[TABLE]
Proof.
We first show for every locally Lipschitz continuous function and for almost every that
[TABLE]
If we have
[TABLE]
as is continuous and is continuous on . Furthermore, [19, Thm 3.3(i)] implies that
[TABLE]
for almost every with . Since a finite union of nullsets is again a nullset, this proves the claim (18). The lemma follows by induction over the layers of and applying (18) with . ∎
IV General Activation Functions
As mentioned in the introduction, it is possible to replace the ReLU activation function in (2) by some locally Lipschitz continuous, component-wise applied function with an at most countably large set of points where is not differentiable. Specifically, one can define the neural network derivative (with activation function ) as in (9) with in (8) replaced by
[TABLE]
The chain rule can, again, be checked by direct computation and it is straightforward to adapt Theorem III.1 to this more general setting by considering the level sets
[TABLE]
If additionally is continuous on , the proof of Lemma III.3 translates without any modifications.
V Utilization in Approximation Theory
These results can now be employed to bound the -norm of , given corresponding estimates for the approximation of and by and , respectively. Here, one has to take some care when bounding the term
[TABLE]
by
[TABLE]
Again it can happen that maps a set of positive measure into a nullset where the estimate for the approximation of by in the essential supremum norm is not valid. However, using the stability result in Lemma III.3 one can for almost every shift to a sufficiently close point where the estimate holds. In [13] Yarotsky explicitly constructs networks whose realization is a linear interpolation111The interpolation points are uniformly distributed over the domain of approximation and their number grows exponentially with the size of the networks. of the squaring function (see Fig. 1 for illustration), which directly gives an estimate on the approximation rate for the derivatives. These simple networks can then be combined to get networks approximating multiplication, polynomials and eventually, by means of e.g. local Taylor approximation, functions whose first (weak) derivatives are bounded. This leads to estimates of the form
[TABLE]
with , including estimates for the scaling of the size of the network w.r.t. and . As these constructions are based on composing simpler functions with known estimates one can now employ Theorem III.1 and Corollary III.2 to show that the derivatives of those networks also approximate the derivative of the function, i.e.
[TABLE]
Such constructive approaches can further be found in [8], in [14] for -cartoon-like functions, in [20] for -holomorphic maps, and in [15] for high-frequent sinusoidal functions.
VI Global Error Estimates
The error estimates above are usually only sensible for bounded domains, as the realization of a neural network is always CPL with a finite number of pieces. We briefly discuss a general way of transforming them into global pointwise error estimates, which can be useful in the context of PDEs (see e.g. [9, 10]). In the following assume that we have a function with an at most polynomially growing derivative, i.e.
[TABLE]
Denote by a neural network which represents the -dimensional approximate characteristic function of , i.e. and
[TABLE]
See [15, Proof of Thm. VIII.3] for such a construction. Further let be the neural network approximating the multiplication function on with error (see e.g. [20, Prop. 3.1]).
Now we define the global approximation networks as the composition of with the parallelization of and for suitable
[TABLE]
See Figure 2 for an illustration and e.g. [14, Def. 2.7] for a formal definition of parallelization. Considering the errors on , and leads to global estimates, i.e. for every
[TABLE]
and, by use of the chain rule III.2, for almost every
[TABLE]
Due to the logarithmic size scaling of the multiplication network, the size of can be bounded by the size of plus an additional term in .
VII Application to PDEs
Analyzing the regularity properties of neural networks was motivated by the recent successful application of deep learning methods to PDEs [2, 3, 4, 5, 6, 7, 11]. Initiated by empirical experiments [1] it has been proven that neural networks are capable of overcoming the curse of dimensionality for solving so-called Kolmogorov PDEs [12]. More precisely, the solution to the empirical risk minimization problem over a class of neural networks approximates the solution of the PDE up to error with high probability and with size of the networks and number of samples scaling only polynomially in the dimension and . The above requires a suitable learning problem and a sufficiently good approximation of the solution function by neural networks. For Kolmogorov PDEs, this boils down to calculating global Lipschitz coefficients and error estimates for neural networks approximating the initial condition and coefficient functions (see e.g. [9, 10]). Employing estimates of the form (26) one can bound the derivative on , i.e.
[TABLE]
Using mollification and the mean value theorem we can establish local Lipschitz estimates, i.e. for all that
[TABLE]
and corresponding linear growth bounds
[TABLE]
Similarly, one can use (31) to obtain estimates of the form
[TABLE]
for all (which are demanded in [10, Theorem 1.1]). Moreover, note that the capability to produce approximation results which include error estimates for the derivative is of significant independent interest. Various numerical methods (for instance Galerkin methods) rely on bounding the error in some Sobolev norm , which requires estimates of the derivative differences. We believe that the possibility to obtain regularity estimates significantly contributes to the mathematical theory of neural networks and allows for further advances in the numerical approximation of high dimensional partial differential equations.
VIII Relation to backpropagation in training
The approach discussed here could further be applied to the training of neural networks by (stochastic) gradient descent. Note, however, that this is a slightly different setting. From the approximation theory perspective we were interested in the derivative of , while in training one requires the derivative of for some fixed sample . In particular this function is no longer CPL but rather continuous piecewise polynomial. While this would necessitate some technical modifications, we believe that it should be possible to employ the method used here in order to show that the gradient of coincides a.e. with what is computed by backpropagation using the convention of setting the derivative of to zero at the origin (as well as similar conventions for e.g. max-pooling).
Acknowledgment
The research of JB and DE was supported by the Austrian Science Fund (FWF) under grants I3403-N32 and P 30148.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. Beck, S. Becker, P. Grohs, N. Jaafari, and A. Jentzen, “Solving stochastic differential equations and Kolmogorov equations by means of deep learning,” ar Xiv:1806.00421 , 2018.
- 2[2] W. E, J. Han, and A. Jentzen, “Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations,” Communications in Mathematics and Statistics , vol. 5, no. 4, pp. 349–380, 2017.
- 3[3] J. Han, A. Jentzen, and W. E, “Solving high-dimensional partial differential equations using deep learning,” ar Xiv:1707.02568 , 2017.
- 4[4] J. Sirignano and K. Spiliopoulos, “DGM: A deep learning algorithm for solving partial differential equations,” ar Xiv:1708.07469 , 2017.
- 5[5] M. Fujii, A. Takahashi, and M. Takahashi, “Asymptotic Expansion as Prior Knowledge in Deep Learning Method for high dimensional BSD Es,” ar Xiv:1710.07030 , 2017.
- 6[6] Y. Khoo, J. Lu, and L. Ying, “Solving parametric PDE problems with artificial neural networks,” ar Xiv:1707.03351 , 2017.
- 7[7] W. E and B. Yu, “The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems,” ar Xiv:1710.00211 , 2017.
- 8[8] D. Elbrächter, P. Grohs, A. Jentzen, and C. Schwab, “DNN Expression Rate Analysis of high-dimensional PD Es: Application to Option Pricing,” ar Xiv:1809.07669 , 2018.
