Optimal Approximation with Sparsely Connected Deep Neural Networks

Helmut B\"olcskei; Philipp Grohs; Gitta Kutyniok; Philipp Petersen

arXiv:1705.01714·cs.LG·May 17, 2018

Optimal Approximation with Sparsely Connected Deep Neural Networks

Helmut B\"olcskei, Philipp Grohs, Gitta Kutyniok, Philipp Petersen

PDF

TL;DR

This paper establishes fundamental lower bounds on the connectivity and memory of deep neural networks for optimal approximation of function classes, demonstrating their universality and near-optimality through theoretical analysis and numerical experiments.

Contribution

It derives lower bounds linking neural network complexity to function class complexity and proves these bounds are achievable for broad classes of functions, showing neural networks' universality.

Findings

01

Neural networks can achieve near-optimal approximation rates for various function classes.

02

Stochastic gradient descent can learn sparse, near-optimal approximations.

03

Theoretical bounds are supported by numerical experiments.

Abstract

We derive fundamental lower bounds on the connectivity and the memory requirements of deep neural networks guaranteeing uniform approximation rates for arbitrary function classes in $L^{2} (R^{d})$ . In other words, we establish a connection between the complexity of a function class and the complexity of deep neural networks approximating functions from this class to within a prescribed accuracy. Additionally, we prove that our lower bounds are achievable for a broad family of function classes. Specifically, all function classes that are optimally approximated by a general class of representation systems---so-called \emph{affine systems}---can be approximated by deep neural networks with minimal connectivity and memory requirements. Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets, ridgelets, curvelets, shearlets,…

Equations253

Φ (x) = W_{L} ρ (W_{L - 1} ρ (\dots ρ (W_{1} (x)))), for x \in R^{d},

Φ (x) = W_{L} ρ (W_{L - 1} ρ (\dots ρ (W_{1} (x)))), for x \in R^{d},

N N_{\infty, M, d, ρ} := L \in N ⋃ N N_{L, M, d, ρ}, N N_{L, \infty, d, ρ} := M \in N ⋃ N N_{L, M, d, ρ}, N N_{\infty, \infty, d, ρ} := L \in N ⋃ N N_{L, \infty, d, ρ} .

N N_{\infty, M, d, ρ} := L \in N ⋃ N N_{L, M, d, ρ}, N N_{L, \infty, d, ρ} := M \in N ⋃ N N_{L, M, d, ρ}, N N_{\infty, \infty, d, ρ} := L \in N ⋃ N N_{L, \infty, d, ρ} .

Γ_{M}^{D} (f) := I_{M} \subseteq I, # I_{M} = M, (c_{i})_{i \in I_{M}} in f f - i \in I_{M} \sum c_{i} φ_{i}_{L^{2} (Ω)} .

Γ_{M}^{D} (f) := I_{M} \subseteq I, # I_{M} = M, (c_{i})_{i \in I_{M}} in f f - i \in I_{M} \sum c_{i} φ_{i}_{L^{2} (Ω)} .

f \in C sup Γ_{M}^{D} (f) \in O (M^{- γ}), M \to \infty,

f \in C sup Γ_{M}^{D} (f) \in O (M^{- γ}), M \to \infty,

Γ_{M}^{NN} (f) := Φ \in N N_{\infty, M, d, ρ} in f ∥ f - Φ ∥_{L^{2} (Ω)} .

Γ_{M}^{NN} (f) := Φ \in N N_{\infty, M, d, ρ} in f ∥ f - Φ ∥_{L^{2} (Ω)} .

f \in C sup Γ_{M}^{NN} (f) \in O (M^{- γ}), M \to \infty,

f \in C sup Γ_{M}^{NN} (f) \in O (M^{- γ}), M \to \infty,

f \in C sup I_{M} \subset {1, \dots, π (M)}, # I_{M} = M, (c_{i})_{i \in I_{M}}, m a x_{i \in I_{M}} ∣ c_{i} ∣ \leq D in f f - i \in I_{M} \sum c_{i} φ_{i}_{L^{2} (Ω)} \in O (M^{- γ}), M \to \infty,

f \in C sup I_{M} \subset {1, \dots, π (M)}, # I_{M} = M, (c_{i})_{i \in I_{M}}, m a x_{i \in I_{M}} ∣ c_{i} ∣ \leq D in f f - i \in I_{M} \sum c_{i} φ_{i}_{L^{2} (Ω)} \in O (M^{- γ}), M \to \infty,

x \in [0, 1]^{d} sup ∣ f (x) - Φ (x) ∣ \leq ε .

x \in [0, 1]^{d} sup ∣ f (x) - Φ (x) ∣ \leq ε .

f \in C sup Φ_{M} \in N N_{3, M, d, ρ} in f ∥ f - Φ_{M} ∥_{L^{2} (Ω)} \leq C M^{- γ}, for all M \in N .

f \in C sup Φ_{M} \in N N_{3, M, d, ρ} in f ∥ f - Φ_{M} ∥_{L^{2} (Ω)} \leq C M^{- γ}, for all M \in N .

f \in C sup Φ_{M} \in NN_{L, M, d, ρ}^{π} in f ∥ f - Φ_{M} ∥_{L^{2} (Ω)} \in O (M^{- γ}), M \to \infty,

f \in C sup Φ_{M} \in NN_{L, M, d, ρ}^{π} in f ∥ f - Φ_{M} ∥_{L^{2} (Ω)} \in O (M^{- γ}), M \to \infty,

E^{ℓ} := {E : C \to {0, 1}^{ℓ}}

E^{ℓ} := {E : C \to {0, 1}^{ℓ}}

D^{ℓ} := {D : {0, 1}^{ℓ} \to L^{2} (Ω)}

D^{ℓ} := {D : {0, 1}^{ℓ} \to L^{2} (Ω)}

f \in C sup ∥ D (E (f)) - f ∥_{L^{2} (Ω)} \leq ε .

f \in C sup ∥ D (E (f)) - f ∥_{L^{2} (Ω)} \leq ε .

L (ε, C) := min {ℓ \in N : \exists (E, D) \in E^{ℓ} \times D^{ℓ} : f \in C sup ∥ D (E (f)) - f ∥_{L^{2} (Ω)} \leq ε} .

L (ε, C) := min {ℓ \in N : \exists (E, D) \in E^{ℓ} \times D^{ℓ} : f \in C sup ∥ D (E (f)) - f ∥_{L^{2} (Ω)} \leq ε} .

γ^{*} (C) := sup {γ \in R : L (ε, C) \in O (ε^{- 1/ γ}), ε \to 0} .

γ^{*} (C) := sup {γ \in R : L (ε, C) \in O (ε^{- 1/ γ}), ε \to 0} .

γ^{*, eff} (C, D) \leq γ^{*} (C) .

γ^{*, eff} (C, D) \leq γ^{*} (C) .

γ^{*, eff} (C, D) = γ^{*} (C),

γ^{*, eff} (C, D) = γ^{*} (C),

γ_{NN}^{*, eff} (C, ρ) \leq γ^{*} (C) .

γ_{NN}^{*, eff} (C, ρ) \leq γ^{*} (C) .

γ_{NN}^{*, eff} (C, ρ) = γ^{*} (C) .

γ_{NN}^{*, eff} (C, ρ) = γ^{*} (C) .

Learn : (0, \frac{1}{2}) \times C \to N N_{\infty, \infty, d, ρ}

Learn : (0, \frac{1}{2}) \times C \to N N_{\infty, \infty, d, ρ}

f \in C sup ∥ f - Learn (ε, f) ∥_{L^{2} (Ω)} \leq ε .

f \in C sup ∥ f - Learn (ε, f) ∥_{L^{2} (Ω)} \leq ε .

f \in C sup M (Learn (ε, f)) \in / O (ε^{- 1/ γ}), ε \to 0, \mbox f or a l l γ > γ^{*} (C) .

f \in C sup M (Learn (ε, f)) \in / O (ε^{- 1/ γ}), ε \to 0, \mbox f or a l l γ > γ^{*} (C) .

ℓ (ε) \leq C_{0} \cdot f \in C sup [M (Learn (ε, f)) lo g_{2} (M (Learn (ε, f))) + 1] lo g_{2} (ε^{- 1}),

ℓ (ε) \leq C_{0} \cdot f \in C sup [M (Learn (ε, f)) lo g_{2} (M (Learn (ε, f))) + 1] lo g_{2} (ε^{- 1}),

d + ℓ = 1 \sum L N_{ℓ} \leq 2 M,

d + ℓ = 1 \sum L N_{ℓ} \leq 2 M,

L \leq M

L \leq M

A_{ℓ} = 0.

A_{ℓ} = 0.

x \mapsto W_{L} ρ (W_{L - 1} \dots W_{ℓ + 1} ρ (0 \cdot x + b_{ℓ})),

x \mapsto W_{L} ρ (W_{L - 1} \dots W_{ℓ + 1} ρ (0 \cdot x + b_{ℓ})),

((L + 1) \cdot (lo g_{2} (M) + 1)) \leq (M + 1) lo g_{2} (M) + M + 1.

((L + 1) \cdot (lo g_{2} (M) + 1)) \leq (M + 1) lo g_{2} (M) + M + 1.

M lo g_{2} (M) + 2 lo g_{2} (M) + 2 M + 1.

M lo g_{2} (M) + 2 lo g_{2} (M) + 2 M + 1.

i = 1 \sum N (n (i) + 1) \cdot (lo g_{2} (M) + 1) \leq 3 M \cdot (lo g_{2} (M) + 1),

i = 1 \sum N (n (i) + 1) \cdot (lo g_{2} (M) + 1) \leq 3 M \cdot (lo g_{2} (M) + 1),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Optimal Approximation with Sparsely Connected Deep Neural Networks

Helmut Bölcskei111Department of Information Technology and Electrical Engineering, ETH Zürich, 8092 Zürich, Switzerland. Email-Address: [email protected]

Philipp Grohs222Faculty of Mathematics, University of Vienna, 1090 Vienna, Austria, and Research Platform DataScience@UniVienna, University of Vienna, 1090 Vienna, Austria. Email-Address: [email protected]

Gitta Kutyniok333Institut für Mathematik, Technische Universität Berlin, 10623 Berlin, Germany. Email-Addresses: $\{$ kutyniok,petersen $\}$ @math.tu-berlin.de

Philipp Petersen333Institut für Mathematik, Technische Universität Berlin, 10623 Berlin, Germany. Email-Addresses: $\{$ kutyniok,petersen $\}$ @math.tu-berlin.de

Abstract

We derive fundamental lower bounds on the connectivity and the memory requirements of deep neural networks guaranteeing uniform approximation rates for arbitrary function classes in $L^{2}(\mathbb{R}^{d})$ . In other words, we establish a connection between the complexity of a function class and the complexity of deep neural networks approximating functions from this class to within a prescribed accuracy. Additionally, we prove that our lower bounds are achievable for a broad family of function classes. Specifically, all function classes that are optimally approximated by a general class of representation systems—so-called affine systems—can be approximated by deep neural networks with minimal connectivity and memory requirements. Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets, ridgelets, curvelets, shearlets, $\alpha$ -shearlets, and more generally $\alpha$ -molecules. Our central result elucidates a remarkable universality property of neural networks and shows that they achieve the optimum approximation properties of all affine systems combined. As a specific example, we consider the class of $\alpha^{-1}$ -cartoon-like functions, which is approximated optimally by $\alpha$ -shearlets. We also explain how our results can be extended to the case of functions on low-dimensional immersed manifolds. Finally, we present numerical experiments demonstrating that the standard stochastic gradient descent algorithm generates deep neural networks providing close-to-optimal approximation rates. Moreover, these results indicate that stochastic gradient descent can actually learn approximations that are sparse in the representation systems optimally sparsifying the function class the network is trained on.

Keywords. Neural networks, function approximation, optimal sparse approximation, sparse connectivity, wavelets, shearlets

AMS subject classification. 41A25, 82C32, 42C40, 42C15, 41A46, 68T05, 94A34, 94A12

1 Introduction

Neural networks arose from the seminal work by McCulloch and Pitts [38] in 1943 which, inspired by the functionality of the human brain, introduced an algorithmic approach to learning with the aim of building a theory of artificial intelligence. Roughly speaking, a neural network consists of neurons arranged in layers and connected by weighted edges; in mathematical terms this boils down to a concatenation of affine linear functions and relatively simple non-linearities.

Despite significant theoretical progress in the 1990s [10, 31], the area has seen practical progress only during the past decade, triggered by the drastic improvements in computing power and the availability of vast amounts of training data. Deep neural networks, i.e., networks with large numbers of layers, are now state-of-the-art technology for a wide variety of applications, such as image classification [33], speech recognition [30], or game intelligence [11]. For an in-depth overview, we refer to the survey paper by LeCun, Bengio, and Hinton [36] and the recent book [22].

A neural network effectively implements a non-linear mapping and can be used to either perform classification directly or to extract features that are then fed into a classifier, such as a support vector machine [50]. In the former case, the primary goal is to approximate an unknown classification function based on a given set of input-output value pairs. This is typically accomplished by learning the network’s weights through, e.g., the stochastic gradient descent (via backpropagation) algorithm [48]. In a classification task with, say, two classes, the function to be learned would take only two values, whereas in the case of, e.g., the prediction of the temperature in a certain environment, it would be real-valued. It is therefore clear that characterizing to what extent (deep) neural networks are capable of approximating general functions is a question of significant practical relevance.

Neural networks employed in practice often consist of hundreds of layers and may depend on billions of parameters, see for example the work [29] on image classification. Training and operation of networks of this scale entail formidable computational challenges. As a case in point, we mention speech recognition on a smartphone such as, e.g., Apple’s SIRI-system, which operates in the cloud. Android’s speech recognition system has meanwhile released an offline version based on a neural network with sparse connectivity. The desire to reduce the complexity of network training and operation naturally leads to the question of the fundamental limits on function approximation through neural networks with sparse connectivity. In addition, the network’s memory requirements in terms of the number of bits needed to store its topology and weights are of concern in practice.

The purpose of this paper is to understand the connectivity and memory requirements of (deep) neural networks induced by demands on their approximation-theoretic properties. Specifically, defining the complexity of a function class $\mathcal{C}$ as the rate of growth of the minimum number of bits needed to describe any element in $\mathcal{C}$ to within a maximum allowed error approaching zero, we shall be interested in the following question: Depending on the complexity of $\mathcal{C}$ , what are the connectivity and memory requirements of a deep neural network approximating every element in $\mathcal{C}$ to within an error of $\varepsilon$ ? We address this question by interpreting the network as an encoder in Donoho’s min-max rate distortion theory [17] and establishing rate-distortion optimality for a broad family of function classes $\mathcal{C}$ , namely those classes for which so-called affine systems—a general class of representation systems—yield optimal approximation rates in the sense of non-linear approximation theory [14]. Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets [12], ridgelets [3], curvelets [5], shearlets [28], $\alpha$ -shearlets and more generally $\alpha$ -molecules [25]. Our result therefore uncovers an interesting universality property of deep neural networks; they exhibit the optimal approximation properties of all affine systems combined. The technique we develop to prove our main statements is interesting in its own right as it constitutes a more general framework for transferring results on function approximation through representation systems to results on approximation by deep neural networks.

1.1 Deep Neural Networks

While various network architectures exist in the literature, we focus on the following setup.

Definition 1.1.

Let $L,d,N_{1},\ldots,N_{L}\in\mathbb{N}$ with $L\,\geq\,2$ . A map $\Phi:\mathbb{R}^{d}\to\mathbb{R}^{N_{L}}$ given by

[TABLE]

with affine linear maps $W_{\ell}:\mathbb{R}^{N_{\ell-1}}\to\mathbb{R}^{N_{\ell}}$ , $1\leq\ell\leq L$ , and the non-linear activation function $\rho$ acting component-wise, is called a neural network. Here, $N_{0}:=d$ is the dimension of the [math]-th layer referred to as the input layer, $L$ denotes the number of layers (not counting the input layer), $N_{1},\ldots,N_{L-1}$ stands for the dimensions of the $L-1$ hidden layers, and $N_{L}$ is the dimension of the output layer. The affine linear map $W_{\ell}$ is defined via $W_{\ell}(x)=A_{\ell}x+b_{\ell}$ with $A_{\ell}\in\mathbb{R}^{N_{\ell}\times N_{\ell-1}}$ and the affine part $b_{\ell}\in\mathbb{R}^{N_{\ell}}$ . $(A_{\ell})_{i,j}$ represents the weight associated with the edge between the $j$ -th node in the $(\ell-1)$ -th layer and the $i$ -th node in the $\ell$ -th layer, while $(b_{\ell})_{i}$ is the weight associated with the $i$ -th node in the $\ell$ -th layer. These assignments are schematized in Figure LABEL:fig:Weights. The total number of nodes is given by $\mathcal{N}(\Phi):=d+\sum_{\ell=1}^{L}N_{\ell}$ . The real numbers $(A_{\ell})_{i,j}$ and $(b_{\ell})_{i}$ are said to be the network’s edge weights and node weights, respectively, and the total number of nonzero edge weights, denoted by $\mathcal{M}(\Phi)$ , is the network’s connectivity.

The term “network” stems from the interpretation of the mapping $\Phi$ as a weighted acyclic directed graph with nodes arranged in $L+1$ hierarchical layers and edges only between adjacent layers. If the network’s connectivity $\mathcal{M}(\Phi)$ is small relative to the number of connections possible (i.e., the number of edges in the graph that is fully connected between adjacent layers), we say that the network is sparsely connected.

Throughout the paper, we consider the case $\Phi:\mathbb{R}^{d}\to\mathbb{R}$ , i.e., $N_{L}=1$ , which includes situations such as the classification and temperature prediction problem described above. We emphasize, however, that the general results of Sections 3, 4, and 5 are readily generalized to $N_{L}>1$ .

We denote the class of networks $\Phi:\mathbb{R}^{d}\to\mathbb{R}$ with exactly $L$ layers, connectivity no more than $M$ , and activation function $\rho$ by $\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ with the understanding that for $L=1$ , the set $\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ is empty. Moreover, we let

[TABLE]

Now, given a function $f:\mathbb{R}^{d}\to\mathbb{R}$ , we are interested in the theoretically best possible approximation of $f$ by a network $\Phi\in\mathcal{N}\mathcal{N}_{\infty,M,d,\rho}$ . Specifically, we will want to know how the approximation quality depends on the connectivity $M$ and what the associated number of bits needed to store the network topology and the corresponding quantized weights is. Clearly, smaller $M$ entails lower computational complexity in terms of evaluating (1) and a smaller number of bits translates to reduced memory requirements for storing the network. Such a result benchmarks all conceivable algorithms for learning the network topology and weights.

1.2 Quantifying Approximation Quality

We proceed to formalizing our problem statement and start with a brief review of a widely used framework in approximation theory [15, 14].

Fix $\Omega\subset\mathbb{R}^{d}$ . Let $\mathcal{C}$ be a compact set of functions in $L^{2}(\Omega)$ , henceforth referred to as function class, and consider a corresponding system $\mathcal{D}:=(\varphi_{i})_{i\in I}\subset L^{2}(\Omega)$ with $I$ countable, termed representation system. We study the error of best $M$ -term approximation of $f\in\mathcal{C}$ in $\mathcal{D}$ :

Definition 1.2.

[15]** Given $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , a function class $\mathcal{C}\subset L^{2}(\Omega)$ , and a representation system $\mathcal{D}=(\varphi_{i})_{i\in I}\subset L^{2}(\Omega)$ , we define, for $f\in\mathcal{C}$ and $M\in\mathbb{N}$ ,

[TABLE]

We call $\Gamma_{M}^{\mathcal{D}}(f)$ the best $M$ -term approximation error of $f$ in $\mathcal{D}$ . Every $f_{M}=\sum_{i\in I_{M}}c_{i}\varphi_{i}$ attaining the infimum in (2) is referred to as a best $M$ -term approximation of $f$ in $\mathcal{D}$ . The supremal $\gamma>0$ such that

[TABLE]

will be denoted by $\gamma^{\ast}(\mathcal{C},\mathcal{D})$ . We say that the best $M$ -term approximation rate of $\mathcal{C}$ in the representation system $\mathcal{D}$ is $\gamma^{\ast}(\mathcal{C},\mathcal{D})$ .

Function classes $\mathcal{C}$ widely studied in the approximation theory literature include unit balls in Lebesgue, Sobolev, or Besov spaces [14], as well as $\alpha$ -cartoon-like functions [25]. A wealth of structured representation systems $\mathcal{D}$ is provided by the area of applied harmonic analysis, starting with wavelets [12], followed by ridgelets [3], curvelets [5], shearlets [28], parabolic molecules [27], and most generally $\alpha$ -molecules [25], which include all previously named systems as special cases. Further examples are Gabor frames [23] and wave atoms [13].

1.3 Approximation by Deep Neural Networks

The main conceptual contribution of this paper is the development of an approximation-theoretic framework for deep neural networks in the spirit of [15]. Specifically, we shall substitute the concept of best $M$ -term approximation with representation systems by best $M$ -edge approximation through neural networks. In other words, parsimony in terms of the number of participating elements of a representation system is replaced by parsimony in terms of connectivity. More formally, we consider the following setup.

Definition 1.3.

Given $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , a function class $\mathcal{C}\subset L^{2}(\Omega)$ , and an activation function $\rho:\mathbb{R}\to\mathbb{R}$ , we define, for $f\in\mathcal{C}$ and $M\in\mathbb{N}$ ,

[TABLE]

We call $\Gamma_{M}^{\mathcal{N}\mathcal{N}}(f)$ the best $M$ -edge approximation error of $f$ . The supremal $\gamma>0$ such that

[TABLE]

will be denoted by $\gamma_{\mathcal{N}\mathcal{N}}^{\ast}(\mathcal{C},\rho)$ . We say that the best $M$ -edge approximation rate of $\mathcal{C}$ by neural networks with activation function $\rho$ is $\gamma_{\mathcal{N}\mathcal{N}}^{\ast}(\mathcal{C},\rho)$ .

We emphasize that the infimum in (3) is taken over all networks with fixed activation function $\rho$ , fixed input dimension $d$ , no more than $M$ edges of nonzero weight, and arbitrary number of layers $L$ . In particular, this means that the infimum is taken over all possible network topologies. The resulting best $M$ -edge approximation rate is fundamental as it benchmarks all learning algorithms, i.e., all algorithms that map an input function $f$ and an $\varepsilon>0$ to a neural network that approximates $f$ with error no more than $\varepsilon$ . Our framework hence provides a means for assessing the performance of a given learning algorithm in the sense of allowing to measure how close the $M$ -edge approximation rate induced by the algorithm is to the best $M$ -edge approximation rate $\gamma_{\mathcal{N}\mathcal{N}}^{\ast}(\mathcal{C},\rho)$ .

1.4 Previous Work

The best-known results on approximation by neural networks are the universal approximation theorems of Hornik [31] and Cybenko [10], stating that every measurable function $f$ can be approximated arbitrarily well by a single-hidden-layer ( $L=2$ in our terminology) neural network. The literature on approximation-theoretic properties of networks with a single hidden layer continuing this line of work is abundant. Without any claim to completeness, we mention work on approximation error bounds in terms of the number of neurons for functions with bounded first moments [1], [2], the non-existence of localized approximations [6], a fundamental lower bound on approximation rates [16, 3], and the approximation of smooth or analytic functions [41, 39].

Approximation-theoretic results for networks with multiple hidden layers were obtained in [32, 40] for general functions, in [21] for continuous functions, and for functions together with their derivatives in [44]. In [6] it was shown that for certain approximation tasks deep networks can perform fundamentally better than single-hidden-layer networks. We also highlight two recent papers, which investigate the benefit—from an approximation-theoretic perspective—of multiple hidden layers. Specifically, in [19] it was shown that there exists a function which, although expressible through a small three-layer network, can only be represented through a very large two-layer network; here size is measured in terms of the total number of neurons in the network. In the setting of deep convolutional neural networks, first results of a nature similar to those in [19] were reported in [43]. Additionally, by linking the expressivity properties of neural networks to tensor decompositions, [8, 9] established the existence of functions that can be realized by relatively small deep convolutional networks but require exponentially larger shallow networks. For survey articles on approximation-theoretic aspects of neural networks, we refer the interested reader to [20, 47].

Most closely related to our work is that by Shaham, Cloninger, and Coifman [49], which shows that for functions that are sparse in specific wavelet frames, the best $M$ -edge approximation rate of three-layer neural networks is at least as high as the best $M$ -term approximation rate in piecewise linear wavelet frames.

1.5 Contributions

Our contributions can be grouped into four threads.

•

Fundamental lower bound on connectivity. We quantify the minimum network connectivity needed to allow approximation of all elements of a given function class $\mathcal{C}$ to within a maximum allowed error. On a conceptual level, this result establishes a universal link between the complexity of a given function class and the connectivity required by corresponding approximating neural networks.

•

Transfer from $M$ -term to $M$ -edge approximation. We develop a general framework for transferring best $M$ -term approximation results in representation systems to best $M$ -edge approximation results for neural networks.

•

Memory requirements. We characterize the memory requirements needed to store the topology and the quantized weights of optimally-approximating neural networks.

•

Realizability of optimal approximation rates. An important practical question is how neural networks trained by stochastic gradient descent (via backpropagation) [48] perform relative to the fundamental bounds established in the paper. Interestingly, our numerical experiments indicate that stochastic gradient descent can achieve $M$ -edge approximation rates quite close to the fundamental limit.

1.6 Outline of the Paper

Section 2 introduces the novel concept of effective best $M$ -edge approximation. The fundamental lower bound on connectivity is developed in Section 3. Section 4 describes a general framework for transferring best $M$ -term approximation results in representation systems to best $M$ -edge approximation results for neural networks. In Section 5, we apply this transfer framework to the broad class of affine representation systems, and Section 6 shows that this leads to optimal $M$ -edge approximation rates for cartoon functions. In Section 7, we briefly outline the extension of our main findings to the approximation of functions defined on manifolds. Finally, numerical results assessing the performance of stochastic gradient descent (via backpropagation) relative to our lower bound on connectivity are reported in Section 8.

2 Effective Best $M$ -term and Best $M$ -edge Approximation

We proceed by introducing $M$ -term approximation via dictionaries and $M$ -edge approximation via neural networks. These concepts do, however, not allow for a meaningful notion of optimality in practice. A remedy is provided by effective best $M$ -term approximation according to [17, 24] and the new concept of effective best $M$ -edge approximation introduced below.

2.1 Effective Best $M$ -term Approximation

The best $M$ -term approximation rate $\gamma^{\ast}(\mathcal{C},\mathcal{D})$ according to Definition 1.2 measures the hardness of approximation of a given function class $\mathcal{C}$ by a fixed representation system $\mathcal{D}$ . It is sensible to ask whether for a given function class $\mathcal{C}$ , there is a fundamental limit on $\gamma^{\ast}(\mathcal{C},\mathcal{D})$ when one is allowed to vary over $\mathcal{D}$ . As shown in [17, 24], every dense (and countable) $\mathcal{D}\subset L^{2}(\Omega)$ , $\Omega\subset\mathbb{R}^{d}$ , results in $\gamma^{\ast}(\mathcal{C},\mathcal{D})=\infty$ for all function classes $\mathcal{C}\subset L^{2}(\Omega)$ . However, identifying the elements in $\mathcal{D}$ participating in the best $M$ -term approximation is infeasible as it entails searching through the infinite set $\mathcal{D}$ and requires, in general, an infinite number of bits to describe the indices of the participating elements. This insight leads to the concept of “best $M$ -term approximation subject to polynomial-depth search” as introduced by Donoho in [17].

Definition 2.1.

Given $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , a function class $\mathcal{C}\subset L^{2}(\Omega)$ , and a representation system $\mathcal{D}=(\varphi_{i})_{i\in I}\subset L^{2}(\Omega)$ , the supremal $\gamma>0$ so that there exist a polynomial $\pi$ and a constant $D>0$ such that

[TABLE]

*will be denoted by $\gamma^{\ast,\text{eff}}(\mathcal{C},\mathcal{D})$ and referred to as effective best $M$ -term approximation rate of $\mathcal{C}$ in the representation system $\mathcal{D}$ . *

We will demonstrate in Section 3.2 that $\sup_{\mathcal{D}\subset L^{2}(\Omega)}\gamma^{\ast,\text{eff}}(\mathcal{C},\mathcal{D})$ is, indeed, finite under quite general conditions on $\mathcal{C}$ and, in particular, depends on the “description complexity” of $\mathcal{C}$ . This will allow us to assess the approximation capacity of a given representation system $\mathcal{D}$ for $\mathcal{C}$ by comparing $\gamma^{\ast,\text{eff}}(\mathcal{C},\mathcal{D})$ to the ultimate limit $\sup_{\mathcal{D}\subset L^{2}(\Omega)}\gamma^{\ast,\text{eff}}(\mathcal{C},\mathcal{D})$ .

2.2 Effective Best $M$ -edge Approximation

We next aim at establishing a relationship in the spirit of effective best $M$ -term approximation for approximation through deep neural networks. To this end, we first note that Definition 1.3 encounters problems similar to those identified for approximation by representation systems, namely the quantity $\sup_{\rho:\mathbb{R}\to\mathbb{R}}\gamma^{\ast}_{\mathcal{N}\mathcal{N}}(\mathcal{C},\rho)$ does not reveal anything tangible about the approximation complexity of $\mathcal{C}$ in deep neural networks, unless further constraints are imposed on the approximating network. To make this point, we first review the following remarkable result:

Theorem 2.2.

[37, Theorem 4]** There exists a function $\rho:\mathbb{R}\to\mathbb{R}$ that is $C^{\infty}$ , strictly increasing, and satisfies $\lim_{x\to\infty}\rho(x)=1$ and $\lim_{x\to-\infty}\rho(x)=0$ , such that for any $d\in\mathbb{N}$ , any $f\in C([0,1]^{d})$ , and any $\varepsilon>0$ , there is a neural network $\Phi$ with activation function $\rho$ and three layers, of dimensions $N_{1}=3d,N_{2}=6d+3$ , and $N_{3}=1$ , satisfying

[TABLE]

We observe that the number of nodes and the number of layers of the approximating network in Theorem 2.2 do not depend on the approximation error $\varepsilon$ . In particular, $\varepsilon$ can be chosen arbitrarily small while having $\mathcal{M}(\Phi)$ bounded. By density of $C([0,1]^{d})$ in $L^{2}([0,1]^{d})$ and hence in all compact subsets of $L^{2}([0,1]^{d})$ , this implies the existence of an activation function $\rho:\mathbb{R}\to\mathbb{R}$ such that $\gamma^{\ast}_{\mathcal{N}\mathcal{N}}(\mathcal{C},\rho)=\infty$ for all compact $\mathcal{C}\subset L^{2}([0,1]^{d}),d\in\mathbb{N}$ . However, the networks underlying Theorem 2.2 necessarily lead to weights that are (in absolute value) not bounded by $|\pi(\varepsilon^{-1})|$ for a polynomial $\pi$ , a requirement we will have to impose to get rate-distortion-optimal approximation through neural networks (see Section 3). To see that the weights, indeed, do not obey a polynomial growth bound in $\varepsilon^{-1}$ , we note that thanks to Theorem 2.2, there exist $C>0$ and $\gamma>0$ such that

[TABLE]

Now, as $\varepsilon$ in Theorem 2.2 can be made arbitrarily small while the connectivity of the corresponding networks remains upper-bounded by $21d^{2}+15d+3$ , (6) would have to hold for arbitrarily large $\gamma$ , in particular also for $\gamma>\gamma_{\mathcal{N}\mathcal{N}}^{\ast,\text{eff}}(\mathcal{C},\rho)$ , where $\gamma_{\mathcal{N}\mathcal{N}}^{\ast,\text{eff}}(\mathcal{C},\rho)$ is the effective best $M$ -edge approximation rate according to Definition 2.3. By Theorem 3.4 below, however, $\gamma_{\mathcal{N}\mathcal{N}}^{\ast,\text{eff}}(\mathcal{C},\rho)\leq{\gamma^{\ast}(\mathcal{C})}$ , where ${\gamma^{\ast}(\mathcal{C})}$ is the optimal exponent according to Definition 3.1. Owing to Definition 2.3, we can therefore conclude that the weights of the network achieving the infimum in (6) can not be bounded by a polynomial in $M\sim\varepsilon^{-1}$ whenever $\gamma^{\ast}(\mathcal{C})<\infty$ . Here and in the sequel, we write $a\sim b$ if the variables $a$ and $b$ are proportional, i.e., there exist uniform constants $c_{1},c_{2}>0$ such that $c_{1}a\leq b\leq c_{2}a$ .

The observation just made resembles the problem in best $M$ -term approximation which eventually led to the concept of effective best $M$ -term approximation, where we restricted the search depth in the representation system $\mathcal{D}$ to be bounded by a given polynomial in $M$ and the coefficients $c_{i}$ to be bounded according to $\max_{i\in I_{M}}\!|c_{i}|\,\leq\,D$ . Interpreting the weights in the network as the counterpart of the coefficients $c_{i}$ in best $M$ -term approximation, we see that the restriction on the search depth corresponds to restricting the size of the indices enumerating the participating weights. The need for such a restriction is obviated by the tree structure of deep neural networks as exposed in detail in the proof of Proposition 3.6. The second restriction will lead us to a growth condition on the weights, which is more generous than the corresponding requirement of the $c_{i}$ in effective best $M$ -term approximation being bounded.

In summary, this leads to the novel concept of “best $M$ -edge approximation subject to polynomial weight growth” as formalized next.

Definition 2.3.

*Given $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , a function class $\mathcal{C}\subset L^{2}(\Omega)$ , and an activation function $\rho:\mathbb{R}\to\mathbb{R}$ , the supremal $\gamma>0$ so that there exist an $L\in\mathbb{N}$ and a polynomial $\pi$ such that *

[TABLE]

*where ${\mathcal{N}\mathcal{N}}_{L,M,d,\rho}^{\pi}$ denotes the class of networks in ${\mathcal{N}\mathcal{N}}_{L,M,d,\rho}$ that have all their weights bounded in absolute value by $\pi(M)$ , will be referred to as effective best $M$ -edge approximation rate of $\mathcal{C}$ by neural networks and denoted by $\gamma^{\ast,\text{eff}}_{\mathcal{N}\mathcal{N}}(\mathcal{C},\rho)$ . *

We will show in Corollary 3.4 that $\sup_{\rho:\mathbb{R}\to\mathbb{R}}\gamma^{\ast,\text{eff}}_{\mathcal{N}\mathcal{N}}(\mathcal{C},\rho)$ is bounded and depends on the “description complexity” of the function class $\mathcal{C}$ .

3 Fundamental Bounds on Effective $M$ -Term and $M$ -Edge Approximation Rate

The purpose of this section is to establish fundamental bounds on effective best $M$ -term and effective best $M$ -edge approximation rates by evaluating the corresponding approximation strategies in the min-max rate distortion theory framework as developed in [17, 24].

3.1 Min-Max Rate Distortion Theory

Min-max rate distortion theory provides a theoretical foundation for deterministic lossy data compression. We recall the following notions and concepts from [17, 24].

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , and consider the function class $\mathcal{C}\subset L^{2}(\Omega)$ . Then, for each $\ell\in\mathbb{N}$ , we denote by

[TABLE]

the set of binary encoders of $\mathcal{C}$ of length $\ell$ , and we let

[TABLE]

be the set of binary decoders of length $\ell$ . An encoder-decoder pair $(E,D)\in\mathfrak{E}^{\ell}\times\mathfrak{D}^{\ell}$ is said to achieve uniform error $\varepsilon$ over the function class $\mathcal{C}$ , if

[TABLE]

A quantity of central interest is the minimal length $\ell\in\mathbb{N}$ for which there exists an encoder-decoder pair $(E,D)\in\mathfrak{E}^{\ell}\times\mathfrak{D}^{\ell}$ that achieves uniform error $\varepsilon$ over the function class $\mathcal{C}$ , along with its asymptotic behavior as made precise in the following definition.

Definition 3.1.

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , and $\mathcal{C}\subset L^{2}(\Omega)$ . Then, for $\varepsilon>0$ , the minimax code length $L(\varepsilon,\mathcal{C})$ is

[TABLE]

Moreover, the optimal exponent $\gamma^{*}(\mathcal{C})$ is defined as

[TABLE]

The optimal exponent $\gamma^{*}(\mathcal{C})$ quantifies the minimum growth rate of $L(\varepsilon,\mathcal{C})$ as the error $\varepsilon$ tends to zero and can hence be seen as quantifying the “description complexity” of the function class $\mathcal{C}$ . Larger $\gamma^{*}(\mathcal{C})$ results in smaller growth rate and hence smaller memory requirements for storing signals $f\in\mathcal{C}$ such that reconstruction with uniformly bounded error is possible. The quantity $\gamma^{*}(\mathcal{C})$ is closely related to the concept of Kolmogorov entropy [45]. Remark 5.10 in [24] makes this connection explicit.

The optimal exponent is known for several function classes, such as subsets of Besov spaces $B_{p,q}^{s}(\mathbb{R}^{d})$ with $1\leq p,q<\infty,s>0$ , and $q>(s+1/2)^{-1}$ , namely all functions in $B_{p,q}^{s}(\mathbb{R}^{d})$ of bounded norm, see e.g. [7]. If $\mathcal{C}$ is a bounded subset of $B_{p,q}^{s}(\mathbb{R}^{d})$ , then we have $\gamma^{*}(\mathcal{C})={s}/{d}$ . In the present paper, we shall be particularly interested in so-called $\beta$ -cartoon-like functions, for which the optimal exponent is given by ${\beta}/{2}$ (see [18, 26] and Theorem 6.3).

3.2 Fundamental Bound on Effective Best $M$ -Term Approximation Rate

We next recall a result from [17, 24], which says that, for a given function class $\mathcal{C}$ , the optimal exponent $\gamma^{*}(\mathcal{C})$ constitutes a fundamental bound on the effective best $M$ -term approximation rate of $\mathcal{C}$ in any representation system. This gives operational meaning to $\gamma^{*}(\mathcal{C})$ .

Theorem 3.2 ([17, 24]).

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , $\mathcal{C}\subset L^{2}(\Omega)$ , and assume that the effective best $M$ -term approximation rate of $\mathcal{C}$ in $\mathcal{D}\subset L^{2}(\Omega)$ is $\gamma^{\ast,\text{eff}}(\mathcal{C},\mathcal{D})$ . Then, we have

[TABLE]

In light of this result the following definition is natural (see also [24]).

Definition 3.3.

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , and assume that the effective best $M$ -term approximation rate of $\mathcal{C}\subset L^{2}(\Omega)$ in $\mathcal{D}\subset L^{2}(\Omega)$ is $\gamma^{\ast,\text{eff}}(\mathcal{C},\mathcal{D})$ . If

[TABLE]

*then the function class $\mathcal{C}$ is said to be optimally representable by $\mathcal{D}$ . *

3.3 Fundamental Bound on Effective Best $M$ -Edge Approximation Rate

We now state the first main result of the paper, namely the equivalent of Theorem 3.2 for approximation by deep neural networks. Specifically, we establish that the optimal exponent $\gamma^{*}(\mathcal{C})$ also constitutes a fundamental bound on the effective best $M$ -edge approximation rate of $\mathcal{C}$ . We say below that a function $f:\mathbb{R}\to\mathbb{R}$ is dominated by a function $g:\mathbb{R}\to\mathbb{R}$ if $|f(x)|\leq|g(x)|$ , for all $x\in\mathbb{R}$ .

Theorem 3.4.

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ be bounded, and $\mathcal{C}\subset L^{2}(\Omega)$ . Then, for all $\rho:\mathbb{R}\to\mathbb{R}$ that are Lipschitz-continuous or differentiable with $\rho^{\prime}$ dominated by an arbitrary polynomial, we have

[TABLE]

The key ingredients of the proof of Theorem 3.4 are developed throughout this section and the formal proof will be stated at the end of the section. Before embarking on this, we note that, in analogy to Definition 3.3, what we just found suggests the following.

Definition 3.5.

For $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ bounded, we say that the function class $\mathcal{C}\subset L^{2}(\Omega)$ is optimally representable by neural networks with activation function $\rho:\mathbb{R}\to\mathbb{R}$ , if

[TABLE]

It is remarkable that the fundamental limits of approximation through representation systems and approximation through deep neural networks are determined by the same quantity, although the approximants in the two cases are vastly different, namely linear combinations of elements of a representation system with the participating functions identified subject to a polynomial-depth search constraint in the former, and concatenations of affine functions followed by non-linearities under growth constraints on the weights in the network in the latter case.

A key ingredient of the proof of Theorem 3.4 is the following result, which establishes a fundamental lower bound on the connectivity of networks with quantized weights achieving uniform error $\varepsilon$ over a given function class.

Proposition 3.6.

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , $\rho:\mathbb{R}\to\mathbb{R}$ , $c>0$ , and $\mathcal{C}\subset L^{2}(\Omega)$ . Further, let

[TABLE]

be a map such that, for each pair $(\varepsilon,f)\in(0,1/2)\times\mathcal{C}$ , every weight of the neural network $\mathrm{\mathbf{Learn}}(\varepsilon,f)$ is represented by no more than $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ bits while guaranteeing that

[TABLE]

Then,

[TABLE]

Proof.

The proof will be effected by identifying $\mathrm{\mathbf{Learn}}(\varepsilon,f)=D(E(f))$ , where $(E,D)\in\mathfrak{E}^{\ell(\varepsilon)}\times\mathfrak{D}^{\ell(\varepsilon)}$ are encoder-decoder pairs achieving uniform error $\varepsilon$ over $\mathcal{C}$ with

[TABLE]

where $C_{0}>0$ is a constant, and such that the weights in $\mathrm{\mathbf{Learn}}(\varepsilon,f)$ are represented by no more than $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ bits each. Before presenting the construction of these encoder-decoder pairs, we establish that this, indeed, implies the statement of the theorem. To this end, let $\gamma>\gamma^{*}(\mathcal{C})$ and, towards a contradiction to (9), assume that $\sup_{f\in\mathcal{C}}\mathcal{M}(\mathrm{\mathbf{Learn}}(\varepsilon,f))\in\mathcal{O}(\varepsilon^{-1/\gamma}),\varepsilon\rightarrow 0$ . Then, there would exist a $\nu$ with $\gamma>\nu>\gamma^{*}(\mathcal{C})$ such that there are encoder-decoder pairs $(E,D)\in\mathfrak{E}^{\ell(\varepsilon)}\times\mathfrak{D}^{\ell(\varepsilon)}$ achieving uniform error $\varepsilon$ over $\mathcal{C}$ with codelength $\ell(\varepsilon)\in\mathcal{O}(\varepsilon^{-1/\nu}),\varepsilon\rightarrow 0$ , which stands in contradiction to the optimality of $\gamma^{\ast}(\mathcal{C})$ according to Definition 3.1.

We proceed to the construction of the encoder-decoder pairs, which will be accomplished by encoding the network topology and quantized weights in bitstrings of length $\ell(\varepsilon)$ satisfying (10) while guaranteeing unique reconstruction. Fix $f\in\mathcal{C}$ . We enumerate the nodes in $\mathrm{\mathbf{Learn}}(\varepsilon,f)$ by assigning natural numbers, henceforth called indices, increasing from left to right in every layer as schematized in Figure 2. For the sake of notational simplicity, we also set $\Phi:=\mathrm{\mathbf{Learn}}(\varepsilon,f)$ and $M:=\mathcal{M}(\Phi)$ . Without loss of generality, we assume throughout that $M$ is a power of $2$ and greater than $1$ . For all $M$ that are not powers of $2$ , we make use of the fact that $\mathcal{N}\mathcal{N}_{L,M,d,\rho}\subset\mathcal{N}\mathcal{N}_{L,M^{\prime},d,\rho}$ , where $M^{\prime}$ is the smallest power of $2$ larger than $M$ , and we encode the network like an $M^{\prime}$ -edge network. Since $M<M^{\prime}<2M$ this affects $\ell(\varepsilon)$ by a multiplicative constant only. The case $M=0$ will de dealt with in Step 1 below.

We recall that the number of layers of $\Phi$ is denoted by $L$ , the number of nodes in these layers is $N_{1},\dots,N_{L}$ (see Definition 1.1), and $d$ stands for the dimension of the input layer.

Denoting the number of nodes in layer $\ell=1,...,L-1$ associated with edges of nonzero weight in the following layer by $\widetilde{N}_{\ell}$ and setting $\widetilde{N}_{L}=N_{L}$ , it follows that

[TABLE]

where we let $\widetilde{M}:=M+d$ . All other nodes do not contribute to the mapping $\Phi(x)$ and can hence be ignored.

Moreover, we can assume that

[TABLE]

as otherwise there would be at least one layer $\ell>1$ such that

[TABLE]

As a consequence, the reduced network

[TABLE]

realizes the same function as the original network $\Phi$ but has less than $L$ layers. This reduction can be repeated inductively until the resulting reduced network satisfies (12).

The bitstring representing $\Phi$ is constructed according to the following steps.

Step 1: If $M=0$ , we encode the network by a leading [math] followed by the bitstring representing the node weight in the last layer. Upon defining $0\log_{2}(0)=0$ , we then note that (10) holds trivially and we terminate the encoding procedure. Else, we encode the number of nonzero edge weights, $M$ , by starting the overall bitstring with $M$ $1$ ’s followed by a single [math]. The length of this bitstring is therefore bounded by $\widetilde{M}$ .

Step 2: We continue by encoding the number of layers in the network. Thanks to (12) this requires no more than $\log_{2}(\widetilde{M})$ bits. We thus reserve the next $\log_{2}(\widetilde{M})$ bits for the binary representation of $L$ .

Step 3: Next, we store the dimension $d$ of the input layer and the numbers of nodes $\widetilde{N}_{\ell},\ell=1,\dots,L$ , associated with edges of nonzero weight. As by (11) $d\leq\widetilde{M}$ and $\widetilde{N}_{\ell}\leq 2\widetilde{M}$ , for all $\ell$ , we can encode (generously) $d$ and each $\widetilde{N}_{\ell}$ using $\log_{2}(\widetilde{M})+1$ bits. For the sake of concreteness, we first encode $d$ followed by $\widetilde{N}_{1},\dots,\widetilde{N}_{L}$ in that order. In total, Step 3 requires a bitstring of length

[TABLE]

In combination with Steps 1 and 2 this yields an overall bitstring of length at most

[TABLE]

Step 4: We encode the topology of the graph associated with $\Phi$ and consider only nodes that contribute to the mapping $\Phi(x)$ . Recall that we assigned a unique index $i$ , ranging from $1$ to $\widetilde{N}:=d+\sum_{\ell=1}^{L}\widetilde{N}_{\ell}$ , to each of these nodes. By (11) each of these indices can be encoded by a bitstring of length $\log_{2}(\widetilde{M})+1$ . We denote the bitstring corresponding to index $i$ by $b(i)\in\{0,1\}^{\log_{2}(\widetilde{M})+1}$ and let $n(i)$ be the number of children of the node with index $i$ , i.e., the number of nodes in the next layer connected to the node with index $i$ via an edge. For each node $i=1,\dots,\widetilde{N}$ , we form a bitstring of length $n(i)\cdot(\log_{2}(\widetilde{M})+1)$ by concatenating the bitstrings $b(j)$ for all $j$ such that there is an edge between $i$ and $j$ . We follow this string with an all-zeros bitstring of length $\log_{2}(\widetilde{M})+1$ to signal the transition to the node with index $i+1$ . The enumeration is concluded with an all-zeros bitstring of length $\log_{2}(\widetilde{M})+1$ signaling that the last node has been reached. Overall, this yields a bitstring of length

[TABLE]

where we used $\sum_{i=1}^{\widetilde{N}}n(i)=M<\widetilde{M}$ and (11). Combining (13) and (14) it follows that we have encoded the overall topology of the network $\Phi$ using at most

[TABLE]

bits.

Step 5: We encode the weights of $\Phi$ . By assumption, each weight can be represented by a bitstring of length $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ . For each node $i=1,\dots,\widetilde{N}$ , we reserve the first $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ bits to encode its associated node weight and, for each of its children a bitstring of length $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ to encode the weight corresponding to the edge between that child and its parent node. Concatenating the results in ascending order of child node indices, we get a bitstring of length $(n(i)+1)\cdot(\lceil c\log_{2}(\varepsilon^{-1})\rceil)$ for node $i$ , and an overall bitstring of length

[TABLE]

representing the weights of the graph associated with the network $\Phi$ .

With (15) this shows that the overall number of bits needed to encode the network topology and weights is no more than

[TABLE]

The network can be recovered by sequentially reading out $M,L,d$ , the $\widetilde{N}_{\ell}$ , the topology, and the quantized weights from the overall bitstring. It is not difficult to verify that the individual steps in the encoding procedure were crafted such that this yields unique recovery. As (17) can be upper-bounded by

[TABLE]

for a constant $C_{0}>0$ depending on $c$ and $d$ only, we have constructed an encoder-decoder pair $(E,D)\in\mathfrak{E}^{\ell(\varepsilon)}\times\mathfrak{D}^{\ell(\varepsilon)}$ with $\ell(\varepsilon)$ satisfying (10). This concludes the proof. ∎

Proposition 3.6 applies to networks that have each weight represented by a finite number of bits scaling according to $\log_{2}(\varepsilon^{-1})$ while guaranteeing that the underlying encoder-decoder pair achieves uniform error $\varepsilon$ over $\mathcal{C}$ . We next show that such a compatibility is possible for networks with activation functions that are either Lipschitz or differentiable such that $\rho^{\prime}$ is dominated by an arbitrary polynomial. We can now demonstrate that for sufficiently regular activation functions, faithful quantization of the weights of a network is possible.

Lemma 3.7.

Let $d,L,k,M\in\mathbb{N},\eta\in(0,1/2),\Omega\subset\mathbb{R}^{d}$ be bounded, and $\rho:\mathbb{R}\to\mathbb{R}$ be either Lipschitz-continuous or differentiable such that $\rho^{\prime}$ is dominated by an arbitrary polynomial. Let $\Phi\in\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ with $M\leq\eta^{-k}$ and all its weights bounded (in absolute value) by $\eta^{-k}$ . Then, there exist $m\in\mathbb{N}$ , depending on $k,L$ , and $\rho$ only, and $\widetilde{\Phi}\in\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ such that

[TABLE]

and all weights of $\widetilde{\Phi}$ are elements of $\eta^{m}\mathbb{Z}\cap[-\eta^{-k},\eta^{-k}]$ .

Proof.

We prove the statement for Lipschitz-continuous $\rho$ only. The argument for differentiable activation functions with first derivative not growing faster than every polynomial is along similar lines.

Let $m\in\mathbb{N}$ , to be specified later, and denote by $\widetilde{\Phi}$ the network that results by replacing all weights of $\Phi$ by a closest element in $\eta^{m}\mathbb{Z}\cap[-\eta^{-k},\eta^{-k}]$ . Set $C_{\mathrm{max}}:=\eta^{-k}$ and denote the maximum of $1$ and the total number of edge weights plus node weights that contribute to the mapping $\Phi(x)$ by $C_{W}$ . Note that $C_{W}\leq 3M\leq 3\eta^{-k}$ , where the latter inequality is by assumption. For $\ell=1,\dots,L-1$ , define $\Phi^{\ell}:\Omega\to\mathbb{R}^{N_{\ell}}$ as

[TABLE]

and $\widetilde{\Phi}^{\ell}$ accordingly, and let, for $\ell=1,\dots,L-1$ ,

[TABLE]

Denote the maximum of 1 and the Lipschitz constant of $\rho$ by $C_{\rho}$ , set $C_{0}:=\max\{1,\sup\{|x|:x\in\Omega\}\}$ , and let

[TABLE]

Then, it is not difficult to see that

[TABLE]

Additionally, we observe that

[TABLE]

We now bound the quantity $C_{\ell}$ for $\ell=1,\dots,L-1$ . A simple computation, exploiting the Lipschitz-continuity of $\rho$ , yields

[TABLE]

Since $\rho$ is continuous on $\mathbb{R}$ we have $|\rho(0)|<\infty$ and thus, by $C_{\rho},C_{W},C_{\mathrm{max}}\geq 1$ , there exists $C^{\prime}>0$ such that

[TABLE]

As $C_{W}$ and $C_{\mathrm{max}}$ are both bounded by $\eta^{-k-2}$ , it follows that $C_{\ell}$ is bounded by $\eta^{-p}$ for a $p\in\mathbb{N}$ . We can therefore find $n\in\mathbb{N}$ such that

[TABLE]

Invoking (19), we conclude that

[TABLE]

where we set $e_{0}=0$ . We proceed by induction to prove that there exists $r\in\mathbb{N}$ such that for all $\ell=1,\dots,L-1$ ,

[TABLE]

Clearly there exists $r\in\mathbb{N}$ such that $e_{1}\leq\eta^{m-r}$ . Moreover, one easily verifies that the existence of an $r\in\mathbb{N}$ such that (23) is satisfied for an $\ell\in\{1,\dots,L-2\}$ , thanks to (22), implies the existence of an $r\in\mathbb{N}$ such that (23) is satisfied for $\ell$ replaced by $\ell+1$ . This concludes the induction argument.

Using (21) and (23) in (20), we finally obtain

[TABLE]

which yields $e_{L}\leq\eta$ for sufficiently large $m$ . ∎

Remark 3.8.

Note that the weights of the network being elements of $\eta^{m}\mathbb{Z}\cap[-\eta^{-k},\eta^{-k}]$ implies that each weight can be represented by no more than $\lceil c\log_{2}(\eta^{-1})\rceil$ bits, for some constant $c>0$ .

Proposition 3.6 not only says that the connectivity growth rate can not exceed $\mathcal{O}\!\left(\varepsilon^{-1/\gamma^{\ast}(\mathcal{C})}\right),\,\varepsilon\rightarrow 0$ , but its proof, by virtue of constructing an encoder-decoder pair that achieves this growth rate also provides an achievability result. We next establish a matching strong converse in the sense of showing that for $\gamma>{\gamma^{\ast}(\mathcal{C})}$ , the uniform approximation error remains bounded away from zero for infinitely many $M\in\mathbb{N}$ . To simplify terminology in the sequel, we introduce the notion of a polynomially bounded variable.

Definition 3.9.

A real variable $X$ depending on the variables $z_{i}\in D_{i}\subset\mathbb{R}$ , $i=1,\dots,N$ , is said to be polynomially bounded in $z_{1},\dots,z_{N}$ , if there exists an $N$ -dimensional polynomial $\pi$ such that $|X|\leq|\pi(z_{1},\dots,z_{N})|$ , for all $z_{i}\in D_{i},i=1,\dots,N$ . A set of real variables $(X_{j})_{j\in J}$ , each depending on $z_{i}\in D_{i}\subset\mathbb{R}$ , $i=1,\dots,N$ , is uniformly polynomially bounded in $z_{1},\dots,z_{N}$ , if there exists an $N$ -dimensional polynomial $\pi$ such that $|X_{j}|\leq|\pi(z_{1},\dots,z_{N})|$ , for all $j\in J$ and all $z_{i}\in D_{i}$ , $i=1,\dots,N$ .

We will refrain from explicitly specifying the $D_{i}$ in Definition 3.9 whenever they are clear from the context.

Remark 3.10.

If $D_{i}=\mathbb{R}\setminus[-B_{i},B_{i}]$ for some $B_{i}\geq 1,i=1,\dots,N$ , then a variable $X$ depending on $z_{i}\in D_{i},i=1,\dots,N,$ is polynomially bounded in $z_{1},\dots,z_{N}$ if and only if there exists a $k\in\mathbb{N}$ such that $|X|\leq|z_{1}^{k}\cdot z_{2}^{k}\cdot.\ .\ .\cdot z_{N}^{k}|$ , for all $z_{i}\in D_{i}$ .

Proposition 3.11.

Let $d,L\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ be bounded, $\pi$ be a polynomial, $\mathcal{C}\subset L^{2}(\Omega)$ , $\rho:\mathbb{R}\to\mathbb{R}$ either Lipschitz-continuous or differentiable such that $\rho^{\prime}$ is dominated by an arbitrary polynomial. Then, for all $C>0$ and $\gamma>\gamma^{*}(\mathcal{C})$ we have that

[TABLE]

Proof.

Let $\gamma>\gamma^{*}(\mathcal{C})$ . Assume, towards a contradiction, that (24) holds only for finitely many $M\in\mathbb{N}$ . Then, there exists a constant $C$ such that (24) holds for no $M\in\mathbb{N}$ and hence there exists a constant $C$ so that

[TABLE]

Setting $M_{\varepsilon}:=\lceil(\varepsilon/(3C))^{-1/\gamma}\rceil$ , it follows that, for every $f\in\mathcal{C}$ and every $\varepsilon\in(0,1/2)$ , there exists a neural network $\Phi_{\varepsilon,f}\in{\mathcal{N}\mathcal{N}}_{L,M_{\varepsilon},d,\rho}^{\pi}$ such that

[TABLE]

As the weights of $\Phi_{\varepsilon,f}$ are polynomially bounded in $M_{\varepsilon}$ , they are polynomially bounded in $\varepsilon^{-1}$ . By Lemma 3.7 and Remark 3.10, there hence exists a network $\widetilde{\Phi}_{\varepsilon,f}$ whose weights are represented by no more than $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ bits, for some constant $c>0$ , satisfying

[TABLE]

Defining

[TABLE]

it follows that

[TABLE]

The proof is concluded by noting that Learn violates Proposition 3.6. ∎

We can now proceed to the proof of Theorem 3.4.

Proof of Theorem 3.4.

Suppose towards a contradiction that $\gamma_{\mathcal{N}\mathcal{N}}^{\ast,\text{eff}}(\mathcal{C},\rho)>{\gamma^{\ast}(\mathcal{C})}$ . Let $\gamma\in\left(\gamma^{\ast}(\mathcal{C}),\gamma_{\mathcal{N}\mathcal{N}}^{\ast,\text{eff}}(\mathcal{C},\rho)\right)$ . Then, Definition 2.3 implies that there exist a polynomial $\pi,L\in\mathbb{N}$ , and $C>0$ such that

[TABLE]

This, however, constitutes a contradiction to Proposition 3.11. ∎

We conclude this section with a discussion of the conceptual implications of the results established above. Proposition 3.6 combined with Lemma 3.7 establishes that neural networks with weights polynomially bounded in $\varepsilon^{-1}$ and achieving uniform approximation error $\varepsilon$ over $\mathcal{C}$ cannot exhibit edge growth rate smaller than $\mathcal{O}(\varepsilon^{-1/\gamma^{*}(\mathcal{C})}),\varepsilon\rightarrow 0$ ; in other words, a decay of the uniform approximation error, as a function of $M$ , faster than $\mathcal{O}(M^{-\gamma^{\ast}(\mathcal{C})}),M\rightarrow\infty$ , is not possible.

Note that requiring uniform approximation error $\varepsilon$ only (without imposing the constraint of the network’s weights being polynomially bounded in $\varepsilon^{-1}$ ) can lead to arbitrarily large rate $\gamma$ as exemplified by Theorem 2.2, which proves the existence of networks realizing an arbitrarily small approximation error over $L^{2}([0,1]^{d})$ with a finite number of nodes; in particular, the number of nodes remains constant as $\varepsilon\rightarrow 0$ . However, as argued right after Theorem 2.2, these networks necessarily lead to weights that are not polynomially bounded in $\varepsilon^{-1}$ .

4 Transitioning from Representation Systems to Neural Networks

The remainder of this paper is devoted to identifying function classes that are optimally representable—according to Definition 3.5—by neural networks. The mathematical technique we develop in the process is interesting in its own right as it constitutes a general framework for transferring results on function approximation through representation systems to results on approximation by neural networks. In particular, we prove that for a given function class $\mathcal{C}$ and an associated representation system $\mathcal{D}$ which satisfies certain technical conditions, there exists a neural network with $\mathcal{O}(M)$ nonzero edge weights that achieves (up to a multiplicative constant) the same uniform error over $\mathcal{C}$ as a best $M$ -term approximation in $\mathcal{D}$ . This will finally lead to a characterization of function classes $\mathcal{C}$ that are optimally representable by neural networks in the sense of Definition 3.5.

We start by stating technical conditions on representation systems for the transference principle outlined above to apply.

Definition 4.1.

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , $\rho:\mathbb{R}\to\mathbb{R}$ , and $\mathcal{D}=(\varphi_{i})_{i\in I}\subset L^{2}(\Omega)$ be a representation system. Then, $\mathcal{D}$ is said to be representable by neural networks (with activation function $\rho$ ), if there exist $L,R\in\mathbb{N}$ such that for all $\eta>0$ and every $i\in I$ , there is a neural network $\Phi_{i,\eta}\in\mathcal{N}\mathcal{N}_{L,R,d,\rho}$ with

[TABLE]

If, in addition, the weights of $\Phi_{i,\eta}\in\mathcal{N}\mathcal{N}_{L,R,d,\rho}$ are polynomially bounded in $i,\eta^{-1}$ , and if $\rho$ is either Lipschitz-continuous or differentiable such that $\rho^{\prime}$ is dominated by an arbitrary polynomial, then we say that $\mathcal{D}$ is effectively representable by neural networks (with activation function $\rho$ ).

The next result formalizes our transference principle for networks with weights in $\mathbb{R}$ .

Theorem 4.2.

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ , and $\rho:\mathbb{R}\to\mathbb{R}$ . Suppose that $\mathcal{D}=(\varphi_{i})_{i\in I}\subset L^{2}(\Omega)$ is representable by neural networks. Let $f\in L^{2}(\Omega)$ and, for $M\in\mathbb{N}$ , let $f_{M}=\sum_{i\in I_{M}}c_{i}\varphi_{i}$ , $I_{M}\subset I$ , $\#I_{M}=M$ , satisfy

[TABLE]

where $\varepsilon\in(0,1/2)$ . Then, there exist $L\in\mathbb{N}$ (depending on $\mathcal{D}$ only) and a neural network $\Phi(f,M)\in\mathcal{N}\mathcal{N}_{L,M^{\prime},d,\rho}$ with $M^{\prime}\in\mathcal{O}(M)$ , satisfying

[TABLE]

In particular, for all function classes $\mathcal{C}\subset L^{2}(\Omega)$ it holds that

[TABLE]

Proof.

By representability of $\mathcal{D}$ according to Definition 4.1, it follows that there exist $L,R\in\mathbb{N}$ , such that for each $i\in I_{M}$ and for $\eta:=\varepsilon/\max\{1,\sum_{i\in I_{M}}|c_{i}|\}$ , there exists a neural network $\Phi_{i,\eta}\in\mathcal{N}\mathcal{N}_{L,R,d,\rho}$ with

[TABLE]

Let then $\Phi(f,M)$ be the neural network consisting of the networks $(\Phi_{i,\eta})_{i\in I_{M}}$ operating in parallel, all with the same input, and summing their one-dimensional outputs (see Figure 3 below for an illustration) with weights $(c_{i})_{i\in I_{M}}$ according to

[TABLE]

This construction is legitimate as all networks $\Phi_{i,\eta}$ have the same number of layers and the last layer of a neural network according to Definition 1.1 implements an affine function only (without subsequent application of the activation function $\rho$ ). Then, $\Phi(f,M)\in\mathcal{N}\mathcal{N}_{L,RM,d,\rho}$ , and application of the triangle inequality together with (27) yields $\left\|f_{M}-\Phi(f,M)\right\|_{L^{2}(\Omega)}\leq\varepsilon$ . Another application of the triangle inequality according to

[TABLE]

finalizes the proof of (25) which by Definitions 1.2 and 1.3 implies (26). ∎

Theorem 4.2 shows that we can restrict ourselves to the approximation of the individual elements of a representation system by neural networks with the only constraint being that the number of nonzero edge weights in the individual networks must admit a uniform upper bound. Theorem 4.2 does, however, not guarantee that the weights of the network $\Phi(f,M)$ can be represented with no more than $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ bits when the overall approximation error is proportional to $\varepsilon$ . This will again be accomplished through a transfer argument, applied to representation systems $\mathcal{D}$ satisfying slightly more stringent technical conditions.

Theorem 4.3.

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ be bounded, and $\mathcal{C}\subset L^{2}(\Omega)$ . Suppose that the representation system $\mathcal{D}=(\varphi_{i})_{i\in\mathbb{N}}\subset L^{2}(\Omega)$ is effectively representable by neural networks. Then, for all $\gamma<\gamma^{\ast,\text{eff}}(\mathcal{C},\mathcal{D})$ , there exist a polynomial $\pi$ , constants $c>0,L\in\mathbb{N}$ , and a map

[TABLE]

such that for every $f\in\mathcal{C}$ the weights in $\mathrm{\mathbf{Learn}}(\varepsilon,f)$ can be represented by no more than $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ bits while $\|f-\mathrm{\mathbf{Learn}}(\varepsilon,f)\|_{L^{2}(\Omega)}\leq\varepsilon$ and $\mathcal{M}(\mathrm{\mathbf{Learn}}(\varepsilon,f))\in\mathcal{O}(\varepsilon^{-1/\gamma}),\varepsilon\rightarrow 0$ .

Remark 4.4.

Theorem 4.3 implies that if $\mathcal{D}$ optimally represents the function class $\mathcal{C}$ in the sense of Definition 3.3 and at the same time is effectively representable by neural networks, then $\mathcal{C}$ is optimally representable by neural networks in the sense of Definition 3.5.

Proof of Theorem 4.3.

Let $M\in\mathbb{N}$ and $\gamma<\gamma^{\ast,\text{eff}}(\mathcal{C},\mathcal{D})$ . According to Definition 2.1, there exist constants $C,D>0$ and a polynomial $\pi$ such that for every $f\in\mathcal{C}$ , there is a subset $I_{M}\subset\{1,\dots,\pi(M)\}$ , and coefficients $(c_{i})_{i\in I_{M}}$ with $\max_{i\in I_{M}}\!|c_{i}|\leq D$ so that

[TABLE]

We only need to consider the case $\delta_{M}\leq 1/2$ as will become clear below. By effective representability according to Definition 4.1, there are $L,R\in\mathbb{N}$ such that for each $i\in I_{M}$ and with $\eta:=\delta_{M}/\max\{1,4\sum_{i\in I_{M}}|c_{i}|\}$ , there exists a neural network $\Phi_{i,\eta}\in\mathcal{N}\mathcal{N}_{L,R,d,\rho}$ (with $\rho$ either Lipschitz-continuous or differentiable such that $\rho^{\prime}$ is dominated by an arbitrary polynomial) satisfying

[TABLE]

In addition, the weights of $\Phi_{i,\eta}$ are polynomially bounded in $i,\eta^{-1}$ . Let then $\Phi(f,M)\in\mathcal{N}\mathcal{N}_{L,RM,d,\rho}$ be the neural network consisting of the networks $(\Phi_{i,\eta})_{i\in I_{M}}$ operating in parallel, according to (28). We conclude that

[TABLE]

As the weights of the networks $\Phi_{i,\eta}$ are polynomially bounded in $i,\eta^{-1}$ and $i\leq\pi(M),\delta_{M}\sim M^{-\gamma}$ , it follows that the weights of $\Phi(f,M)$ are polynomially bounded in $\delta_{M}^{-1}$ .

[TABLE]

and all weights of $\widetilde{\Phi}(f,M)$ can be represented with no more than $\lceil c\log_{2}(\delta_{M}^{-1})\rceil$ bits, for some $c>0$ . Moreover, we have

[TABLE]

For $\varepsilon\in(0,1/2)$ , we now set

[TABLE]

where

[TABLE]

With this choice of $M_{\varepsilon}$ , we have $CM_{\varepsilon}^{-\gamma}\leq\varepsilon$ , which, when used in (30), yields

[TABLE]

Since, by construction, $\mathrm{\mathbf{Learn}}(\varepsilon,f)$ has $RM_{\varepsilon}$ edges and $M_{\varepsilon}\leq C^{1/\gamma}\varepsilon^{-1/\gamma}+1\leq 2C^{1/{\gamma}}\varepsilon^{-1/{\gamma}}$ , it follows that $\mathrm{\mathbf{Learn}}(\varepsilon,f)$ has at most $2RC^{1/{\gamma}}\varepsilon^{-1/{\gamma}}$ edges. Moreover, as all weights of $\mathrm{\mathbf{Learn}}(\varepsilon,f)$ can be represented by no more than $\lceil c\log_{2}(\delta^{-1}_{M_{\varepsilon}})\rceil$ bits, it follows from $\delta_{M_{\varepsilon}}\sim M_{\varepsilon}^{-\gamma}\sim\varepsilon$ that they can be represented by no more than $\lceil c^{\prime}\log_{2}(\varepsilon^{-1})\rceil$ bits, for some $c^{\prime}>0$ . This concludes the proof. ∎

5 All Affine Representation Systems are Effectively Representable by Neural Networks

This section shows that a large class of representation systems, namely affine systems, defined below, are effectively representable by neural networks. Affine systems include as special cases wavelets, ridgelets, curvelets, shearlets, $\alpha$ -shearlets, and more generally $\alpha$ -molecules. Combined with Theorem 4.3 the results in this section establish that any function class that is optimally represented by an arbitrary affine system is optimally represented by neural networks in the sense of Definition 3.5.

Clearly, such strong statements are possible only under restrictions on the choice of the activation function for the approximating neural networks.

5.1 Choice of Activation Function

We consider two classes of activation functions, namely sigmoidal functions and smooth approximations of rectified linear units. We start with the formal definition of sigmoidal activation functions as considered in [10, 40, 42, 6].

Definition 5.1.

A continuous function $\rho:\mathbb{R}\to\mathbb{R}$ is called a sigmoidal function of order $k\in\mathbb{N}$ , $k\geq 2$ , if there exists $C>0$ such that

[TABLE]

A differentiable function $\rho$ is called strongly sigmoidal of order $k$ , if there exist constants $a,b,C>0$ such that

[TABLE]

One of the most widely used activation functions is the so-called rectified linear unit (ReLU) given by $x\mapsto\max\{0,x\}$ . The second class of activation functions we consider here are smooth versions of the ReLU.

Definition 5.2.

Let $\rho:\mathbb{R}\to\mathbb{R}^{+}$ , $\rho\in C^{\infty}(\mathbb{R})$ satisfy

[TABLE]

for some constant $K>0$ . Then, we call $\rho$ an admissible smooth activation function.

The reason for considering these two specific classes of activation functions resides in the fact that neural networks based thereon allow economical representations of multivariate bump functions, which, in turn, leads to effective representation of all affine systems (built from bump functions) by neural networks. Approximation of multivariate bump functions using sparsely connected neural networks is a classical topic in neural network theory [35]. What is new here is the aspect of quantized weights and rate-distortion optimality.

A class of bump functions of particular importance in wavelet theory are $B$ -splines. In [6] it was shown that $B$ -splines can be parsimoniously approximated by neural networks with sigmoidal activation functions. It is instructive to recall this result. To this end, for $m\in\mathbb{N}$ , we denote the univariate cardinal $B$ -spline of order $m\in\mathbb{N}$ by $N_{m}$ , i.e., $N_{1}=\chi_{[0,1]}$ , where $\chi_{[0,1]}$ denotes the characteristic function of the interval ${[0,1]}$ , and $N_{m+1}=N_{m}*\chi_{[0,1]}$ , for all $m\geq 1$ . Multivariate $B$ -splines are simply tensor products of univariate $B$ -splines. Specifically, we denote, for $d\in\mathbb{N}$ , the $d$ -dimensional cardinal $B$ -spline of order $m$ by $N_{m}^{d}$ .

Theorem 5.3 ([6], Thm. 4.2).

Let $d,m,k\in\mathbb{N}$ , and take $\rho$ to be a sigmoidal function of order $k\geq 2$ . Further, let $L:=\lceil\log_{2}(md-d)/\log_{2}(k)\rceil+1$ . Then, there is $M\in\mathbb{N}$ , possibly dependent on $d,m,k$ , such that for all $D,\varepsilon>0$ , there exists a neural network $\Phi_{D,\varepsilon}\in\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ with

[TABLE]

Additionally, we will need to control the weights in the approximating networks $\Phi_{D,\varepsilon}$ . We next show that this is, indeed, possible for strongly sigmoidal activation functions.

Theorem 5.4.

Let $d,m,k\in\mathbb{N}$ , and $\rho$ strongly sigmoidal of order $k\geq 2$ . Further, let $L:=\lceil\log_{2}(md-d)/\log_{2}(k)\rceil+1$ . Then, there is $M\in\mathbb{N}$ , and a two-dimensional polynomial $\pi$ possibly dependent on $d,m,k$ , such that for all $D,\varepsilon>0$ , there exists a neural network $\Phi_{D,\varepsilon}\in\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ with

[TABLE]

Moreover, the weights of $\Phi_{D,\varepsilon}$ are polynomially bounded in $D,\varepsilon^{-1}$ .

Proof.

The neural network $\Phi_{D,\varepsilon}$ in Theorem 5.3 is explicitly constructed in [6]. Carefully following the steps in that construction and making explicit use of the strong sigmoidality of $\rho$ , as opposed to plain sigmoidality as in [6], yields the desired result. ∎

Remark 5.5.

*We observe that the number of edges of the approximating network in Theorem 5.4 does not depend on the approximation error $\varepsilon$ . *

While Theorem 5.3 demonstrates that a $B$ -spline of order $m$ can be approximated to arbitrary accuracy by a neural network based on a sigmoidal activation function and of depth depending on $m,d$ , and the order of sigmoidality of the activation function, we next establish that for admissible smooth activation functions, exact representation of a general class of bump functions is possible with a network of $3$ layers only. Before proceeding, we define for $f\in L^{1}(\mathbb{R}^{d})$ , $d\in\mathbb{N}$ , the Fourier transform of $f$ by

[TABLE]

Theorem 5.6.

Let $\rho$ be an admissible smooth activation function. Then, for all $d\in\mathbb{N}$ , there exist $M\in\mathbb{N}$ and a neural network $\Phi_{\rho}\in\mathcal{N}\mathcal{N}_{3,M,d,\rho}$ such that

(i)

$\Phi_{\rho}$ * is compactly supported,*

(ii)

$\Phi_{\rho}\in C^{\infty}(\mathbb{R})$ , and

(iii)

$\widehat{\Phi}_{\rho}(\xi)\neq 0$ , for all $\xi\in[-3,3]^{d}$ .

Proof.

We start by constructing an auxiliary function as follows. For $0<p_{1}\leq p_{2}\leq p_{3}$ such that $p_{1}+p_{2}=p_{3}$ , define $t:\mathbb{R}\to\mathbb{R}$ as

[TABLE]

Then, $t\in C^{\infty}$ is compactly supported. Letting $q=\|t\|_{L^{\infty}(\mathbb{R})}$ , we define $g:\mathbb{R}^{d}\to\mathbb{R}$ according to

[TABLE]

By construction, $g\in C^{\infty}$ is compactly supported. Moreover, $g$ can be realized through a three-layer neural network thanks to its two-step design per (33) and (34). Since $g\geq 0$ and $g\neq 0$ , it follows that $|\hat{g}(0)|>0$ . By continuity of $\hat{g}$ there exists a $\delta>0$ such that $|\hat{g}(\xi)|>0$ for all $\xi\in[-\delta,\delta]^{d}$ . We now set

[TABLE]

and note that $\varphi$ can be realized through a three-layer neural network $\Phi_{\rho}\in\mathcal{N}\mathcal{N}_{3,M,d,\rho}$ , for some $M\in\mathbb{N}$ . As $|\hat{\varphi}(\xi)|>0$ , for all $\xi\in[-3,3]^{d}$ , $\Phi_{\rho}$ satisfies the desired assumptions. ∎

5.2 Invariance to Affine Transformations

We next leverage Theorems 5.4 and 5.6 to demonstrate that a wide class of representation systems built through affine transformations of $B$ -splines and bump functions as constructed in Theorem 5.6 is effectively representable by neural networks. As a first step towards this general result, we show that representability—in the sense of Definition 4.1—of a single function $f$ by neural networks is invariant to the operation of taking finite linear combinations of affine transformations of $f$ .

Proposition 5.7.

Let $d\in\mathbb{N}$ , $\rho:\mathbb{R}\to\mathbb{R}$ , and $f\in L^{2}(\mathbb{R}^{d})$ . Assume that there exist $M,L\in\mathbb{N}$ such that for all $D,\varepsilon>0$ , there is $\Phi_{D,\varepsilon}\in\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ with

[TABLE]

Let $A\in\mathbb{R}^{d\times d}$ be full-rank and $b\in\mathbb{R}^{d}$ . Then, there exists $M^{\prime}\in\mathbb{N}$ , depending on $M$ and $d$ only, such that for all $E,\eta>0$ , there is $\Psi_{E,\eta}\in\mathcal{N}\mathcal{N}_{L,M^{\prime},d,\rho}$ with

[TABLE]

Moreover, if the weights of $\Phi_{D,\varepsilon}$ are polynomially bounded in $D,\varepsilon^{-1}$ , then the weights of $\Psi_{E,\eta}$ are polynomially bounded in $\|A\|_{\infty},E,\|b\|_{\infty},\eta^{-1}$ , where $\|A\|_{\infty}$ and $\|b\|_{\infty}$ denote the max-norm of $A$ and $b$ , respectively.

Proof.

By a change of variables, we have for every $\Phi\in\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ that

[TABLE]

and there exists $M^{\prime}$ depending on $M$ and $d$ only such that $|\!\det(A)|^{1/2}\Phi(A\cdot-\,b)\in\mathcal{N}\mathcal{N}_{L,M^{\prime},d,\rho}$ . We furthermore have that

[TABLE]

We now set $F=dE\|A\|_{\infty}+\|b\|_{\infty}$ and $\Psi_{E,\eta}:=|\!\det(A)|^{1/2}\Phi_{F,\eta}(A\cdot-\,b)$ and observe that

[TABLE]

where we applied the same reasoning as in (36) in the first equality, (37) in the first inequality, and (35) in the second inequality. Moreover, we see that if the weights of $\Phi_{D,\varepsilon}$ are polynomially bounded in $D,\varepsilon^{-1}$ , then the weights of $\Psi_{E,\eta}$ are polynomially bounded in $\|A\|_{\infty},|\!\det(A)|,E,\|b\|_{\infty},\eta^{-1}$ . Since $|\!\det(A)|$ is polynomially bounded in $\|A\|_{\infty}$ , it follows that the weights of $\Psi_{E,\eta}$ are polynomially bounded in $|\|A\|_{\infty},E,\|b\|_{\infty},\eta^{-1}$ . This yields the claim. ∎

Next, we show that representability by neural networks is preserved under finite linear combinations of translates.

Proposition 5.8.

Let $d\in\mathbb{N}$ , $\rho:\mathbb{R}\to\mathbb{R}$ , and $f\in L^{2}(\mathbb{R}^{d})$ . Assume that there exist $M,L\in\mathbb{N}$ such that for all $D,\varepsilon>0$ , there is $\Phi_{D,\varepsilon}\in\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ with

[TABLE]

Let $r\in\mathbb{N}$ , $(c_{i})_{i=1}^{r}\subset\mathbb{R}$ , and $(d_{i})_{i=1}^{r}\subset\mathbb{R}^{d}$ . Then, there exists $M^{\prime}\in\mathbb{N}$ , depending on $M,d$ , and $r$ only, such that for all $E,\eta>0$ , there is $\Psi_{E,\eta}\in\mathcal{N}\mathcal{N}_{L,M^{\prime},d,\rho}$ with

[TABLE]

Moreover, if the weights of $\Phi_{D,\varepsilon}$ are polynomially bounded in $D,\varepsilon^{-1}$ , then the weights of $\Psi_{E,\eta}$ are polynomially bounded in

[TABLE]

Proof.

Let $E,\eta>0$ . We start by noting that, for all $D,\varepsilon>0$ ,

[TABLE]

where $d^{*}=\max_{i=1,\dots,r}\|d_{i}\|_{\infty}$ . Setting $D=E+d^{*}$ and $\varepsilon=\eta/\max\{1,\sum_{i=1}^{r}|c_{i}|\}$ , and noting that for every $\Phi\in\mathcal{N}\mathcal{N}_{L,M,d,\rho}$ , the function

[TABLE]

is in $\mathcal{N}\mathcal{N}_{L,M^{\prime},d,\rho}$ with $M^{\prime}\in\mathbb{N}$ depending on $d,r$ , and $M$ only, it follows that the network

[TABLE]

satisfies (39). Finally, if the weights of $\Phi_{D,\varepsilon}$ are polynomially bounded in $D,\varepsilon^{-1}$ , then the weights of $\Psi_{E,\eta}$ are polynomially bounded in $\sum_{i=1}^{r}|c_{i}|,E,d^{*},\eta^{-1}$ .

∎

Based on the invariance results in Propositions 5.7 and 5.8, we now construct neural networks which approximate functions with a given number of vanishing moments with arbitrary accuracy. The resulting construction will be crucial in establishing representability of affine representation systems (see Definition 5.11) by neural networks.

Definition 5.9.

Let $R,d\in\mathbb{N}$ , and $k\in\{1,\dots,d\}$ . A function $g\in C(\mathbb{R}^{d})$ is said to possess $R$ directional vanishing moments in $x_{k}$ -direction, if

[TABLE]

The next result establishes that functions with an arbitrary number of vanishing moments in a given coordinate direction can be built from suitable linear combinations of translates of a given continuous function with compact support.

Lemma 5.10.

Let $R,d\in\mathbb{N}$ , $B>0$ , $k\in\{1,\dots,d\}$ , and $f\in C(\mathbb{R}^{d})$ with compact support. Then, the function

[TABLE]

has $R$ directional vanishing moments in $x_{k}$ -direction. Moreover, if $\hat{f}(\xi)\neq 0$ for all $\xi\in[-B,B]^{d}\setminus\{0\}$ , then

[TABLE]

Proof.

For simplicity of exposition, we consider the case $B=1$ only. Taking the Fourier transform of (40) yields

[TABLE]

which implies

[TABLE]

But by Definition 5.9, this says precisely that $g$ possesses the desired vanishing moments. Statement (41) follows by inspection of (42). ∎

5.3 Affine Representation Systems

We are now ready to introduce the general family of representation systems announced earlier in the paper as affine systems. This class includes all representation systems based on affine transformations of a given “mother function”. Special cases of affine systems are wavelets, ridgelets, curvelets, shearlets, $\alpha$ -shearlets, and more generally $\alpha$ -molecules, as well as tensor products thereof. The formal definition of affine systems is as follows.

Definition 5.11.

Let $d,r,S\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ be bounded, and $f\in L^{2}(\mathbb{R}^{d})$ compactly supported. Let $\delta>0$ , $(c_{i}^{s})_{i=1}^{r}\subset\mathbb{R}$ , for $s=1,\dots,S$ , and $(d_{i})_{i=1}^{r}\subset\mathbb{R}^{d}$ . Further, let $A_{j}\in\mathbb{R}^{d\times d},j\in\mathbb{N}$ , be full-rank, with the absolute values of the eigenvalues of $A_{j}$ bounded below by $1$ . Consider the compactly supported functions

[TABLE]

We define the affine system $\mathcal{D}\subset L^{2}(\Omega)$ corresponding to $(g_{s})_{s=1}^{S}$ according to

[TABLE]

and refer to $f$ as the generator function of $\mathcal{D}$ .

We define the sub-systems $\mathcal{D}_{s,j}:=\{g_{s}^{j,b}\in\mathcal{D}:\ b\in\mathbb{Z}^{d}\}$ . Since every $g_{s}$ , $s=1,\dots,S,$ has compact support, $|\mathcal{D}_{s,j}|$ is finite for all $s=1,\dots,S$ and $j\in\mathbb{N}$ . Indeed, we observe that there exists $c_{\textrm{b}}:=c_{\textrm{b}}((g_{s})_{s=1}^{S},\delta,d)>0$ such that for all $s\in\{1,\dots,S\}$ , $j\in\mathbb{Z}$ , and $b\in\mathbb{Z}^{d}$ ,

[TABLE]

As the $\mathcal{D}_{s,j}$ are finite, we can organize the representation system $\mathcal{D}$ according to

[TABLE]

where the elements within each sub-system $\mathcal{D}_{s,j}$ may be ordered arbitrarily. This ordering of $\mathcal{D}$ is assumed in the remainder of the paper and will be referred to as canonical ordering.

Moreover, we note that if there exists $s_{o}\in\{1,\dots,S\}$ such that $g_{s_{o}}$ is nonzero, then there is a constant $c_{\textrm{o}}:=c_{\textrm{o}}((g_{s})_{s=1}^{S},\delta,d)>0$ such that

[TABLE]

The next result establishes that all affine systems whose generator functions can be approximated to within arbitrary accuracy by neural networks are (effectively) representable by neural networks.

Theorem 5.12.

Let $d\in\mathbb{N}$ , $\rho:\mathbb{R}\to\mathbb{R}$ , $\Omega\subset\mathbb{R}^{d}$ be bounded, and $\mathcal{D}=(\varphi_{i})_{i\in\mathbb{N}}\subset L^{2}(\Omega)$ an affine system with generator function $f$ . Suppose that there exist constants $L,R\in\mathbb{N}$ such that for all $D,\varepsilon>0$ , there is $\Phi_{D,\varepsilon}\in\mathcal{N}\mathcal{N}_{L,R,d,\rho}$ with

[TABLE]

Then, $\mathcal{D}$ is representable by neural networks with activation function $\rho$ . If, in addition, the weights of $\Phi_{D,\varepsilon}$ are polynomially bounded in $D,\varepsilon^{-1}$ , and if there exist $a>0$ and $c>0$ such that

[TABLE]

then $\mathcal{D}$ is effectively representable by neural networks with activation function $\rho$ .

Proof.

Let $(g_{s})_{s=1}^{S}$ be as in Definition 5.11. If $g_{s}=0$ for all $s\in\{1,\dots,S\}$ , then $\mathcal{D}={{\varnothing}}$ and the result is trivial. Hence, we can assume that there exists at least one $s\in\{1,\dots,S\}$ such that $g_{s}\neq 0$ , implying that (45) holds.

Pick $D$ such that $\Omega\subset[-D,D]^{d}$ . We first show that (46) implies representability of $\mathcal{D}$ by neural networks with activation function $\rho$ . To this end, we need to establish the existence of constants $L,R\in\mathbb{N}$ such that for all $i\in\mathbb{N}$ and all $\eta>0$ , there exist $\Phi_{i,\eta}\in\mathcal{N}\mathcal{N}_{L,R,d,\rho}$ with

[TABLE]

The elements of $\mathcal{D}$ consist of dilations and translations of $f$ according to

[TABLE]

for some $r\in\mathbb{N}$ independent of $i$ , and $s_{i}\in\{1,\dots,S\}$ , $j_{i}\in\mathbb{N}$ , and $b_{i}\in\mathbb{Z}^{d}$ . Thus (48) follows directly by Propositions 5.7 and 5.8.

It remains to show that the weights of $\Phi_{D,\varepsilon}$ in (46) polynomially bounded in $D,\varepsilon^{-1}$ implies that $\mathcal{D}$ is effectively representable by neural networks with activation function $\rho$ , which, by Definition 4.1, means that the weights of $\Phi_{i,\eta}$ are polynomially bounded in $i,\eta^{-1}$ . Propositions 5.7 and 5.8 state that the weights of $\Phi_{i,\eta}$ are polynomially bounded in

[TABLE]

Thanks to (43) we have $\|b_{i}\|_{\infty}\in\mathcal{O}(\|A_{j_{i}}\|_{\infty})$ . Moreover, the quantities $D$ , $\sum_{k=1}^{r}|c_{k}|$ , and $\max_{k=1,\dots,r}\|d_{k}\|_{\infty}$ do not depend on $i$ . We can thus conclude that the weights of $\Phi_{i,\eta}$ are polynomially bounded in

[TABLE]

To complete the proof, we need to show that the quantities $\|A_{j_{i}}\|_{\infty}$ are polynomially bounded in $i$ . To this end, we first observe that $\varphi_{i}$ according to (49) satisfies $\varphi_{i}\in\mathcal{D}_{s_{i},j_{i}}$ for some $s_{i}\in\{1,\dots,S\}$ . Thanks to (45) and the canonical ordering (44), there exists a constant $c>0$ such that

[TABLE]

We finally appeal to (47) to conclude that $\|A_{j_{i}}\|_{\infty}$ is polynomially bounded in $i$ , which, together with (50), establishes the desired result. ∎

We remark that condition (47) is very weak; in fact, we are not aware of an affine system in the literature that would violate it.

We now proceed to what is probably the central result of this paper, namely that neural networks provide optimal approximations for all function classes that are optimally approximated by any affine system with generator function that can be approximated to within arbitrary accuracy by neural networks.

Theorem 5.13.

Let $d\in\mathbb{N}$ , $\Omega\subset\mathbb{R}^{d}$ be bounded, $\rho:\mathbb{R}\to\mathbb{R}$ , and $\mathcal{D}=(\varphi_{i})_{i\in\mathbb{N}}\subset L^{2}(\Omega)$ an affine system with generator function $f$ . Assume that there exist $L,R\in\mathbb{N}$ such that for all $D,\varepsilon>0$ , there is $\Phi_{D,\varepsilon}\in\mathcal{N}\mathcal{N}_{L,R,d,\rho}$ satisfying $\|f-\Phi_{D,\varepsilon}\|_{L^{2}([-D,D]^{d})}\leq\varepsilon$ . Then, for all function classes $\mathcal{C}\subset L^{2}(\Omega)$ , we have

[TABLE]

If, in addition, there is a two-dimensional polynomial $\widetilde{\pi}$ such that the weights of $\Phi_{D,\varepsilon}$ are bounded by $|\widetilde{\pi}(D,\varepsilon^{-1})|$ , there exist $a>0$ and $c>0$ such that (47) holds, and $\mathcal{C}$ is optimally represented by $\mathcal{D}$ (according to Definition 3.3), then for all $\gamma<\gamma^{\ast}(\mathcal{C})$ , there exist a constant $c>0$ , a polynomial $\pi$ , and a map

[TABLE]

such that for every $f\in\mathcal{C}$ the weights in $\mathrm{\mathbf{Learn}}(\varepsilon,f)$ can be represented by no more than $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ bits while $\|f-\mathrm{\mathbf{Learn}}(\varepsilon,f)\|_{L^{2}(\Omega)}\leq\varepsilon$ and $\mathcal{M}(\mathrm{\mathbf{Learn}}(\varepsilon,f))\in\mathcal{O}(\varepsilon^{-1/\gamma}),\varepsilon\to 0$ .

Proof.

The proof follows directly by combining Theorem 5.12 with Theorems 4.2 and 4.3. ∎

Theorem 5.13 reveals a remarkable universality and optimality property of neural networks: All function classes that can be optimally represented by an affine system with generator $f$ satisfying (46) are also optimally representable by neural networks.

6 $\alpha$ -Shearlets and Cartoon-Like Functions

We next present an explicit pair $(\mathcal{C},\mathcal{D})$ of function class and representation system satisfying $\gamma_{\mathcal{N}\mathcal{N}}^{\ast}(\mathcal{C},\rho)=\gamma^{\ast}(\mathcal{C},\mathcal{D})$ . Specifically, we take $\alpha$ -shearlets as representation system $\mathcal{D}\subset L^{2}(\mathbb{R}^{2})$ and $\alpha^{-1}$ -cartoon-like functions as function class $\mathcal{C}$ . Cartoon-like functions are piecewise smooth functions with only two pieces. These pieces are separated by a smooth interface. In a sense, they can be understood as a prototype of a two-dimensional classification function with two homogeneous areas corresponding to two classes. Understanding neural network approximation of this function class is hence relevant to classification tasks in machine learning. We point out that the definition of $\alpha$ -shearlets in this paper differs slightly from that in [25]. Concretely, relative to [25] our definition replaces $\alpha^{-1}$ by $\alpha$ so that $\alpha$ -shearlets are a special case of $\alpha$ -molecules, whereas in [25] $\alpha$ -shearlets are a special case of $\alpha^{-1}$ -molecules. We will need dilation and shearing matrices defined as

[TABLE]

This leads us to the following definition which is a slightly modified version of the corresponding definition in [46].

Definition 6.1 ([46]).

For $\delta\in\mathbb{R}^{+}$ , $\alpha\in[0,1]$ , and $f,g\in L^{2}(\mathbb{R}^{2})$ , the cone-adapted $\alpha$ -shearlet system $\mathcal{SH}_{\alpha}(f,g,\delta)$ generated by $f,g\in L^{2}(\mathbb{R}^{2})$ is defined as

[TABLE]

where

[TABLE]

Our interest in $\alpha$ -shearlets stems from the fact that they optimally represent $\alpha^{-1}$ -cartoon-like functions in the sense of Definition 3.3.

Definition 6.2.

Let $\beta\in[1,2)$ , and $\nu>0$ . Define

[TABLE]

where $f_{0},f_{1}\in C^{\beta}(\mathbb{R}^{2})$ , $\operatorname{supp\,}f_{0},\operatorname{supp\,}f_{1}\subset(0,1)^{2}$ , $B\subset[0,1]^{2}$ , $\partial B\in C^{\beta}$ , $\|f_{1}\|_{C^{\beta}},\|f_{2}\|_{C^{\beta}},\|\partial B\|_{C^{\beta}}<\nu$ , and $\chi_{B}$ denotes the characteristic function of $B$ . The elements of $\mathcal{E}^{\beta}(\mathbb{R}^{2};\nu)$ are called $\beta$ -cartoon-like functions.

This function class was originally introduced in [18] as a model class for functions governed by curvilinear discontinuities of prescribed regularity. In this sense, $\beta$ -cartoon-like functions provide a convenient model for images governed by edges or for the solutions of transport equations which often exibit curvilinear singularities.

The optimal exponent $\gamma^{\ast}(\mathcal{E}^{\beta}(\mathbb{R}^{2};\nu))$ was found in [18, 26]:

Theorem 6.3.

For $\beta\in[1,2]$ , and $\nu>0$ , we have

[TABLE]

Proof.

The proof of [18, Theorem 2] demonstrates that a general function class $\mathcal{C}$ has optimal exponent $\gamma^{*}(\mathcal{C})={(2-p)}/{2p}$ if $\mathcal{C}$ contains a copy of $\ell^{p}_{0}$ . The result now follows, since by [26], the function class $\mathcal{E}^{\beta}(\mathbb{R}^{2};\nu)$ does, indeed, contain a copy of $\ell^{p}_{0}$ for $p={2}/{(\beta+1)}$ . ∎

Using Proposition 3.6, this result allows to conclude that neural networks achieving uniform approximation error $\varepsilon$ over the class $\mathcal{C}$ of cartoon-like functions, with weights represented by no more than $\lceil c\log_{2}(\varepsilon^{-1})\rceil$ bits, for some constant $c>0$ , yield an effective best $M$ -edge approximation rate of at most $\beta/2$ . Theorem 6.8 below demonstrates achievability for $\beta=1/\alpha$ , with $\alpha\in[1/2,1]$ .

The following theorem states that $\alpha$ -shearlets yield optimal best $M$ -term approximation rates for $\alpha^{-1}$ -cartoon-like functions.

Theorem 6.4 ([46], Theorem 6.3 and Remark 6.4).

Let $\alpha\in[1/2,1]$ , $\nu>0$ , $f\in C^{12}(\mathbb{R}^{2})$ , $g\in C^{32}(\mathbb{R}^{2})$ , both compactly supported and such that

(i)

$\widehat{f}(\xi)\neq 0$ , for all $|\xi|\leq 1$ ,

(ii)

$\widehat{g(\xi)}\neq 0$ , for all $\xi=(\xi_{1},\xi_{2})^{T}\in\mathbb{R}^{2}$ such that $1/3\leq|\xi_{1}|\leq 3$ and $|\xi_{2}|\leq|\xi_{1}|$ ,

(iii)

$g$ * has at least $7$ vanishing moments in $x_{1}$ -direction, i.e.,*

[TABLE]

Then, there exists $\delta^{\ast}>0$ such that for all $\delta<\delta^{\ast}$ , the function class $\mathcal{E}^{1/\alpha}(\mathbb{R}^{2};\nu)$ is optimally represented by $\mathcal{SH}_{\alpha}(f,g,\delta)$ .

Remark 6.5.

The assumptions on the smoothness and the number of vanishing moments of $f$ and $g$ in Theorem 6.4 follow from [46, Eq. 4.9] with $s_{1}=3/2,s_{0}=0,p_{0}=q_{0}=2/3,$ and $|\beta|\leq 4$ . While these particular choices allow the statement of the theorem to be independent of $\alpha$ , it is possible to weaken the assumptions, if a fixed $\alpha$ is considered. For example, for $\alpha=1/2$ the smoothness assumptions on $f$ and $g$ reduce to $f\in C^{11},g\in C^{28}$ .

As our approximation results for neural networks pertain to bounded domains, we require a definition of cartoon-like functions on bounded domains.

Definition 6.6.

Let $(0,1)^{2}\subset\Omega\subset\mathbb{R}^{2}$ , $\alpha\in[1/2,1]$ , and $\nu>0$ . We define the set of $\alpha^{-1}$ -cartoon-like functions on $\Omega$ by

[TABLE]

Additionally, for $\delta>0$ , $f,g\in L^{2}(\mathbb{R}^{2})$ , we define an $\alpha$ -shearlet system on $\Omega$ according to

[TABLE]

Remark 6.7.

It is straightforward to check, that if $\mathcal{E}^{1/\alpha}(\mathbb{R}^{2};\nu)$ is optimally represented by $\mathcal{SH}_{\alpha}(f,g,\delta)$ , then $\mathcal{E}^{1/\alpha}(\Omega;\nu)$ is optimally represented by $\mathcal{SH}_{\alpha}(f,g,\delta;\Omega)$ .

We proceed to the main statement of this section.

Theorem 6.8.

Suppose that $(0,1)^{2}\subset\Omega\subset\mathbb{R}^{2}$ is bounded and $\rho:\mathbb{R}\to\mathbb{R}$ is either strongly sigmoidal of order $k\geq 2$ (see Definition 5.1) or an admissible smooth activation function (see Definition 5.2). Then, for every $\alpha\in[1/2,1]$ , the function class $\mathcal{E}^{1/\alpha}(\Omega;\nu)$ is optimally representable by a neural network with activation function $\rho$ .

Proof.

Let $\alpha\in[1/2,1]$ and $\nu>0$ . We first consider the case of $\rho$ strongly sigmoidal of order $k\geq 2$ . Since the two-dimensional cardinal $B$ -spline of order $34$ , denoted by $N^{2}_{34}$ , is $32$ times continuously differentiable and $\widehat{N^{2}_{34}}(0)\neq 0$ by construction, we conclude that there exists $c>0$ such that $f:=N^{2}_{34}(c\cdot)$ satisfies $f\in C^{32}(\mathbb{R}^{2})$ and $\hat{f}\neq 0$ for all $\xi\in[-3,3]^{2}$ . Application of Lemma 5.10 then yields the existence of $(c_{i})_{i=1}^{7}\subset\mathbb{R}$ , $(d_{i})_{i=1}^{7}\subset\mathbb{R}^{2}$ such that $g:=\sum_{i=1}^{7}c_{i}f(\cdot-d_{i})$ is compactly supported, has $7$ vanishing moments in $x_{1}$ -direction, and $\hat{g}(\xi)\neq 0$ for all $\xi\in[-3,3]^{2}$ such that $\xi_{1}\neq 0$ . Then, by Theorem 6.4 and Remark 6.7 there exists $\delta>0$ such that $\mathcal{SH}_{\alpha}(f,g,\delta;\Omega)$ is optimal for $\mathcal{E}^{1/\alpha}(\Omega;\nu)$ . We define

[TABLE]

where we order $(A_{j})_{j\in\mathbb{N}}$ such that $|\!\det(A_{j})|\leq|\!\det(A_{j+1})|$ , for all $j\in\mathbb{N}$ . This construction implies that the $\alpha$ -shearlet system $\mathcal{SH}_{\alpha}(f,g,\delta;\Omega)$ is an affine system with generator function $f$ . Thanks to Theorem 5.4, there exist $L,R\in\mathbb{N}$ such that for all $D,\varepsilon>0$ , there is a network $\Phi_{D,\varepsilon}\in\mathcal{N}\mathcal{N}_{L,R,d,\rho}$ with

[TABLE]

Moreover, the weights of $\Phi_{D,\varepsilon}$ are polynomially bounded in $D,\varepsilon^{-1}$ . It is not difficult to verify that (47) holds and hence Theorem 5.12 yields that $\mathcal{SH}_{\alpha}(f,g,\delta;\Omega)$ is effectively representable by neural networks with activation function $\rho$ . Finally, since $\mathcal{E}^{1/\alpha}(\Omega;\nu)$ is optimally representable by $\mathcal{SH}_{\alpha}(f,g,\delta;\Omega)$ , we conclude with Theorem 4.3 that $\mathcal{E}^{1/\alpha}(\Omega;\nu)$ is optimally representable by neural networks with activation function $\rho$ .

It remains to establish the statement for admissible smooth $\rho$ . In this case, by Theorem 5.6 there exist $M\in\mathbb{N}$ and a neural network in $\mathcal{N}\mathcal{N}_{3,M,d,\rho}$ which realizes a compactly supported $f\in C^{\infty}(\mathbb{R})$ satisfying $\hat{f}(\xi)\neq 0$ , for all $\xi\in[-3,3]^{2}$ . Lemma 5.10 applied to this $f$ then yields a function $g$ that can be realized by a neural network in $\mathcal{N}\mathcal{N}_{3,M^{\prime},d,\rho}$ , for some $M^{\prime}\in\mathbb{N}$ , has $7$ vanishing moments in $x_{1}$ -direction, is compactly supported, and satisfies $g\in C^{\infty}(\mathbb{R})$ , and $\hat{g}(\xi)\neq 0$ , for all $\xi\in[-3,3]^{2}$ such that $\xi_{1}\neq 0$ . By Theorem 6.4 and Remark 6.7, there exists $\delta>0$ such that $\mathcal{E}^{1/\alpha}(\Omega;\nu)$ is optimally representable by $\mathcal{SH}_{\alpha}(f,g,\delta;\Omega)$ . Note that $\mathcal{SH}_{\alpha}(f,g,\delta;\Omega)$ is an affine system with generator function $f$ . Since $f$ can be implemented with zero error by a neural network, Theorem 5.12 yields that $\mathcal{SH}_{\alpha}(f,g,\delta;\Omega)$ is effectively representable by neural networks with admissible smooth activation function $\rho$ . Optimality of $\mathcal{SH}_{\alpha}(f,g,\delta;\Omega)$ for $\mathcal{E}^{1/\alpha}(\Omega;\nu)$ implies, with Theorem 4.3, that $\mathcal{E}^{1/\alpha}(\Omega;\nu)$ is optimally representable by neural networks with admissible smooth activation function $\rho$ . ∎

Remark 6.9.

Theorem 6.4 requires the generators of the shearlet system guaranteeing optimal representability of $\mathcal{E}^{1/\alpha}(\Omega;\nu)$ , for $1/2\leq\alpha\leq 1,\nu>0$ , $\Omega\subset\mathbb{R}^{2}$ to be very smooth. On the other hand, Theorem 6.8 demonstrates that optimally-approximating neural networks are not required to be particularly smooth. Indeed, Theorem 6.8 holds for networks with differentiable but not necessarily twice differentiable activation functions. As the proof of Theorem 6.8 reveals, such weak assumptions suffice thanks to Theorem 5.4, which demonstrates that it is possible to approximate arbitrarily smooth $B$ -splines (in the $L^{2}$ -norm) to within error $\varepsilon$ by neural networks with a number of weights that does not depend on $\varepsilon$ as long as the activation function is strongly sigmoidal.

Remark 6.10.

We observe from the proof of Theorem 6.8 that the depth of the networks required to achieve optimal approximation depends on the activation function only. Indeed, for an admissible smooth activation function, inspection of Theorem 5.6 reveals that networks with three layers can produce optimal approximations in Theorem 6.8. On the other hand, if a sigmoidal activation function is employed, Theorem 5.4 shows that the construction in Theorem 6.8 requires a certain minimum depth depending on the order of sigmoidality.

7 Generalization to Manifolds

Frequently, a function $f$ to be approximated by a neural network models phenomena on (possibly low-dimensional) immersed submanifolds $\Gamma\subset\mathbb{R}^{d}$ of dimension $m<d$ . We next briefly outline how our main results can be extended to this situation. Since analogous results, for the case of wavelets as representation systems, appear already in [49], we will allow ourselves to be somewhat informal.

Suppose that $f:\Gamma\to\mathbb{R}$ is compactly supported. Let $(U_{i})_{i\in\mathbb{N}}\subset\Gamma$ be an open cover of $\Gamma$ such that for each $i\in\mathbb{N}$ the manifold patch $U_{i}$ can be parametrized as the graph of a function over a subset of the Euclidean coordinates, i.e., there exist coordinates $x_{d_{1}},\dots,x_{d_{m}}$ , open sets $V_{i}\subset\mathbb{R}^{m}$ , and smooth mappings

[TABLE]

such that

[TABLE]

Take a smooth partition of unity $(h_{i})_{i\in\mathbb{N}}$ , where $h_{i}:\Gamma\to\mathbb{R}$ is smooth with $\mathrm{supp}(h_{i})\subset\overline{U_{i}}$ and $\sum_{i\in\mathbb{N}}h_{i}=1$ . Define the localization of $f$ to $U_{i}$ by $f_{i}:=fh_{i}$ such that

[TABLE]

Every $f_{i}:U_{i}\to\mathbb{R}$ can be reparametrized to

[TABLE]

Suppose that there exist $L,M\in\mathbb{N}$ and neural networks $\tilde{\Phi}_{i}\in\mathcal{N}\mathcal{N}_{L,M,m,\rho}$ such that

[TABLE]

Then, we can construct a neural network $\Phi_{i}\in\mathcal{N}\mathcal{N}_{L,M+md,d,\rho}$ according to

[TABLE]

where $P_{i}$ denotes the orthogonal projection of $x$ onto the coordinates $(x_{d_{1}},\dots,x_{d_{m}})$ . Since $P_{i}$ is linear, $\Phi_{i}$ is a neural network. Moreover, since $P_{i}$ is the inverse of the diffeomorphism $\Xi_{i}$ , we get

[TABLE]

with $C>0$ depending on the curvature of $\Gamma|_{U_{i}}$ only. Now we may build a neural network $\Phi$ by setting $\Phi:=\sum_{i\in\mathbb{N}}\Phi_{i}$ . Combining (52) with the observation that, owing to the compact support of $f$ , only a finite number of summands appears in the definition of $f$ , we have constructed a neural network $\Phi$ which approximates $f$ on $\Gamma$ . In summary, we observe the following.

Whenever a function class $\mathcal{C}$ is invariant with respect to diffeomorphisms (in our construction the functions $\Xi_{i}$ ) and multiplication by smooth functions (in our construction the functions $h_{i}$ ), then approximation results on $\mathbb{R}^{m}$ can be lifted to approximation results on $m$ -dimensional submanifolds $\Gamma\subset\mathbb{R}^{d}$ .*

Such invariances are, in particular, satisfied for all function classes characterized by a particular smoothness behavior, for example, the class of cartoon-functions as studied in Section 6.

8 Numerical Results

Our theoretical results show that neural networks realizing uniform approximation error $\varepsilon$ over a function class $\mathcal{C}\subset L^{2}(\mathbb{R}^{d})$ , $d\in\mathbb{N}$ , must obey a fundamental lower bound on the growth rate (as $\varepsilon\rightarrow 0$ ) of the number of edges of nonzero weight. One of the most widely used learning algorithms is stochastic gradient descent with the gradient computed via backpropagation [48]. The purpose of this section is to investigate how this algorithm fares relative to our lower bound.

Interestingly, our numerical experiments below indicate that for a fixed, sparsely connected, network topology inspired by the construction of bump functions according to (33) and (34), and with the ReLU as activation function, the stochastic gradient descent algorithm generates neural networks that achieve $M$ -edge approximation rates quite close to the fundamental limit.

The network topology we prescribe is depicted in Figure 3. The rationale for choosing this topology is as follows. As mentioned before, admissible smooth activation functions consist of smooth functions which equal a ReLU outside a compact interval. For this class of activation functions, the associated $\alpha$ -shearlet generators were constructed from a function $g$ as specified in (34). Choosing $p_{1}=p_{2}=1$ and $p_{3}=2$ in (33) yields hat functions $t$ . This construction implies that six nodes are required in the first layer in each subnetwork.

In Figure 3, we see four network realizations of $g$ in parallel. The output layer realizes a linear combination of the subnetworks.

We now train the network using the stochastic gradient descent algorithm. Following (34) the weights of the second layer remain fixed, and the weights in the first and the third layer only are trained. Training is performed for two different functions, where one is a function with a line singularity (Figure 4(a)), and the other one is a cartoon-like function (Figure 5(a)). Specifically, we train the network by drawing samples $(x_{1},x_{2})$ from an equispaced grid in $[-1,1]^{2}$ . The resulting error is then backpropagated through the network. We repeat this procedure for different network sizes, i.e., for different numbers of subnetworks.

We start by discussing the results for the function with a line singularity depicted in Figure 4(a). The approximation error corresponding to the trained neural network is shown in Figure 4(b). The faster than linear decay of the approximation error in the semi-logarithmic scale indicates faster than exponential decay with respect to the number of edges. This is consistent with the best $M$ -term approximation rate that ridgelets yield for piecewise constant functions with line singularities, see [4].

It is interesting to observe that the trained subnetworks yield $\alpha$ -molecules for $\alpha=0$ (see Figures 4(c)-(e)). These functions are constant along one direction and vary along another, hence can be considered part of a ridgelet system, which is, in fact, an optimally sparsifying representation system for line singularities. Moreover, the orientation of the three learned ridge functions matches that of the original function.

In the second experiment, we draw samples from the function depicted in Figure 5(a) below, which exhibits a curvilinear singularity. Figures 5(c)-(e) show that the corresponding trained subnetworks resemble anisotropic molecules with different scales and of different orientations. We report, without showing the results, that the decay rate of the corresponding approximation error obtained when simply training with different network sizes did not come close to the rate of $M^{-1}$ predicted by our theory. However, with a slight adaptation one obtains the result of Figure 5(b), which demonstrates a decay of roughly $M^{-1}$ . The specifics of this adaptation are as follows: We first train a large network with $\sim 10000$ edges, again by stochastic gradient descent. Then, the weights in the last layer are optimized using the Lasso [51] to obtain a sparse weight vector $c^{*}$ . We then pick the $M$ largest coefficients of $c^{*}$ and compute the corresponding weighted sum of the associated subnetworks. The resulting approximation error is shown in Figure 5(b). Finally, we investigate whether the approximation characteristics delivered by this procedure are similar to what would be obtained by best $M$ -term approximation with standard shearlet systems. Recall that shearlet elements at high scales tend to cluster around singularities [28, 34]. Figures 5(g)-(i) depict the corresponding results. Specifically, Figure 5(g) shows the weighted sum of those subnetworks that have the largest support. In Figure 5(h), we show weighted sums of subnetworks with medium-sized support, and in Figure 5(i) we sum up only the subnetworks with the smallest supports. We observe that, indeed, subnetworks of large support approximate the smooth part of the underlying function, whereas the subnetworks associated to small supports resolve the jump singularity.

Acknowledgments

The authors would like to thank J. Bruna, E. Candès, M. Genzel, S. Güntürk, Y. LeCun, K.-R. Müller, H. Rauhut, and F. Voigtländer for interesting discussions, and D. Perekrestenko for very detailed and insightful comments on the manuscript. G. K. and P. P. are grateful to the Faculty of Mathematics at the University of Vienna for the hospitality and support during their visits. Moreover, G. K. thanks the Department of Mathematics at Stanford University whose support allowed for completion of a portion of this work. G. K. acknowledges partial support by the Einstein Foundation Berlin, the Einstein Center for Mathematics Berlin (ECMath), the European Commission-Project DEDALE (contract no. 665044) within the H2020 Framework Program, DFG Grant KU 1446/18, DFG-SPP 1798 Grants KU 1446/21 and KU 1446/23, and by the DFG Research Center Matheon “Mathematics for Key Technologies”. G. K. and P. P acknowledge support by the DFG Collaborative Research Center TRR 109 “Discretization in Geometry and Dynamics”.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory , 39(3):930–945, 1993.
2[2] A. R. Barron. Approximation and estimation bounds for artificial neural networks. Mach. Learn. , 14(1):115–133, 1994.
3[3] E. J. Candès. Ridgelets: Theory and Applications, 1998. Ph.D. thesis, Stanford University.
4[4] E. J. Candès. Ridgelets and the representation of mutilated Sobolev functions. SIAM J. Math. Anal. , 33(2):347–368, 2001.
5[5] E. J. Candès and D. L. Donoho. New tight frames of curvelets and optimal representations of objects with piecewise C 2 singularities. Comm. Pure Appl. Math. , 57:219–266, 2002.
6[6] C. K. Chui, X. Li, and H. N. Mhaskar. Neural networks for localized approximation. Math. Comp. , 63(208):607–623, 1994.
7[7] A. Cohen, W. Dahmen, I. Daubechies, and R. A. De Vore. Tree approximation and optimal encoding. Appl. Comput. Harmon. Anal. , 11(2):192–226, 2001.
8[8] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory , pages 698–728, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Optimal Approximation with Sparsely Connected Deep Neural Networks

Abstract

1 Introduction

1.1 Deep Neural Networks

Definition 1.1**.**

1.2 Quantifying Approximation Quality

Definition 1.2**.**

1.3 Approximation by Deep Neural Networks

Definition 1.3**.**

1.4 Previous Work

1.5 Contributions

1.6 Outline of the Paper

2 Effective Best MMM-term and Best MMM-edge Approximation

2.1 Effective Best MMM-term Approximation

Definition 2.1**.**

2.2 Effective Best MMM-edge Approximation

Theorem 2.2**.**

Definition 2.3**.**

3 Fundamental Bounds on Effective MMM-Term and MMM-Edge Approximation Rate

3.1 Min-Max Rate Distortion Theory

Definition 3.1**.**

3.2 Fundamental Bound on Effective Best MMM-Term Approximation Rate

Theorem 3.2** ([17, 24]).**

Definition 3.3**.**

3.3 Fundamental Bound on Effective Best MMM-Edge Approximation Rate

Theorem 3.4**.**

Definition 3.5**.**

Proposition 3.6**.**

Proof.

Lemma 3.7**.**

Proof.

Remark 3.8**.**

Definition 3.9**.**

Remark 3.10**.**

Proposition 3.11**.**

Proof.

Proof of Theorem 3.4.

4 Transitioning from Representation Systems to Neural Networks

Definition 4.1**.**

Theorem 4.2**.**

Proof.

Theorem 4.3**.**

Remark 4.4**.**

Proof of Theorem 4.3.

5 All Affine Representation Systems are Effectively Representable by Neural Networks

5.1 Choice of Activation Function

Definition 5.1**.**

Definition 5.2**.**

Theorem 5.3** ([6], Thm. 4.2).**

Theorem 5.4**.**

Proof.

Remark 5.5**.**

Theorem 5.6**.**

Proof.

5.2 Invariance to Affine Transformations

Proposition 5.7**.**

Proof.

Proposition 5.8**.**

Proof.

Definition 5.9**.**

Lemma 5.10**.**

Proof.

5.3 Affine Representation Systems

Definition 5.11**.**

Theorem 5.12**.**

Proof.

Theorem 5.13**.**

Proof.

6 α\alphaα-Shearlets and Cartoon-Like Functions

Definition 6.1** ([46]).**

Definition 6.2**.**

Theorem 6.3**.**

Proof.

Theorem 6.4** ([46], Theorem 6.3 and Remark 6.4).**

Definition 1.1.

Definition 1.2.

Definition 1.3.

2 Effective Best $M$ -term and Best $M$ -edge Approximation

2.1 Effective Best $M$ -term Approximation

Definition 2.1.

2.2 Effective Best $M$ -edge Approximation

Theorem 2.2.

Definition 2.3.

3 Fundamental Bounds on Effective $M$ -Term and $M$ -Edge Approximation Rate

Definition 3.1.

3.2 Fundamental Bound on Effective Best $M$ -Term Approximation Rate

Theorem 3.2 ([17, 24]).

Definition 3.3.

3.3 Fundamental Bound on Effective Best $M$ -Edge Approximation Rate

Theorem 3.4.

Definition 3.5.

Proposition 3.6.

Lemma 3.7.

Remark 3.8.

Definition 3.9.

Remark 3.10.

Proposition 3.11.

Definition 4.1.

Theorem 4.2.

Theorem 4.3.

Remark 4.4.

Definition 5.1.

Definition 5.2.

Theorem 5.3 ([6], Thm. 4.2).

Theorem 5.4.

Remark 5.5.

Theorem 5.6.

Proposition 5.7.

Proposition 5.8.

Definition 5.9.

Lemma 5.10.

Definition 5.11.

Theorem 5.12.

Theorem 5.13.

6 $\alpha$ -Shearlets and Cartoon-Like Functions

Definition 6.1 ([46]).

Definition 6.2.

Theorem 6.3.

Theorem 6.4 ([46], Theorem 6.3 and Remark 6.4).

Remark 6.5.

Definition 6.6.

Remark 6.7.

Theorem 6.8.

Remark 6.9.

Remark 6.10.