Depth Separations in Neural Networks: What is Actually Being Separated?

Itay Safran; Ronen Eldan; Ohad Shamir

arXiv:1904.06984·cs.LG·June 3, 2021

Depth Separations in Neural Networks: What is Actually Being Separated?

Itay Safran, Ronen Eldan, Ohad Shamir

PDF

TL;DR

This paper investigates depth separation in neural networks for Lipschitz radial functions, showing that depth 2 networks can approximate such functions efficiently, challenging previous assumptions about the necessity of greater depth.

Contribution

The paper demonstrates that depth 2 networks can approximate -Lipschitz radial functions with polynomial size in dimension and inverse accuracy, contradicting prior depth separation results.

Findings

01

Depth 2 networks can approximate -Lipschitz radial functions with polynomial size in dimension.

02

Approximation is also possible with size polynomial in 1/psilon for fixed dimension.

03

Simultaneous polynomial dependence on both dimension and inverse accuracy is impossible.

Abstract

Existing depth separation results for constant-depth networks essentially show that certain radial functions in $R^{d}$ , which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$ . However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension $d$ (or equivalently, by scaling the function, the hardness result applies to $O (1)$ -Lipschitz functions only when the target accuracy $ϵ$ is at most $poly (1/ d)$ ). In this paper, we study whether such depth separations might still hold in the natural setting of $O (1)$ -Lipschitz radial functions, when $ϵ$ does not scale with $d$ . Perhaps surprisingly, we show that the answer is negative: In contrast to the…

Figures3

Click any figure to enlarge with its caption.

Equations317

x \in B_{d} sup ∣ n (x) - f (x) ∣ \leq ϵ,

x \in B_{d} sup ∣ n (x) - f (x) ∣ \leq ϵ,

h (x) = a + i = 1 \sum w α_{i} σ (β_{i} x - γ_{i})

h (x) = a + i = 1 \sum w α_{i} σ (β_{i} x - γ_{i})

x \in R sup ∣ f (x) - h (x) ∣ \leq δ .

x \in R sup ∣ f (x) - h (x) ∣ \leq δ .

x \in B_{d} sup i = 1 \sum w v_{i} σ (w_{i}^{⊤} x + b_{i}) + b_{0} - f (x) \leq ϵ,

x \in B_{d} sup i = 1 \sum w v_{i} σ (w_{i}^{⊤} x + b_{i}) + b_{0} - f (x) \leq ϵ,

v_{f, 2} = \int_{R^{d}} ∣ ∣ ω ∣ ∣_{1}^{2} ∣ F (f) (ω) ∣ d ω < \infty.

v_{f, 2} = \int_{R^{d}} ∣ ∣ ω ∣ ∣_{1}^{2} ∣ F (f) (ω) ∣ d ω < \infty.

x \in D sup ∣ f (x) - f_{n} (x) ∣ \leq c v_{f, 2} d + lo g n n^{- 1/2 - 1/ d},

x \in D sup ∣ f (x) - f_{n} (x) ∣ \leq c v_{f, 2} d + lo g n n^{- 1/2 - 1/ d},

x \in B_{d} sup ∣ f (x) - N (x) ∣ < ϵ .

x \in B_{d} sup ∣ f (x) - N (x) ∣ < ϵ .

w (d, 101 exp (2) π^{3} d^{3}) \geq 2^{c_{2} d l o g d} .

w (d, 101 exp (2) π^{3} d^{3}) \geq 2^{c_{2} d l o g d} .

w (d, c_{2} d^{6}) \geq c_{3} exp (c_{4} d) .

w (d, c_{2} d^{6}) \geq c_{3} exp (c_{4} d) .

x \in B_{d} sup ∣ N (x) - F_{d} (∣ ∣ x ∣ ∣) ∣ \leq ϵ,

x \in B_{d} sup ∣ N (x) - F_{d} (∣ ∣ x ∣ ∣) ∣ \leq ϵ,

F_{d} (z) = k = 0 \sum \infty \frac{( d - 2 )!!}{( 2 k )!! ( d + 2 k - 2 )!!} z^{2 k} .

F_{d} (z) = k = 0 \sum \infty \frac{( d - 2 )!!}{( 2 k )!! ( d + 2 k - 2 )!!} z^{2 k} .

I_{s i n^{2} \frac{θ}{2}} (\frac{d - 1}{2}, \frac{d - 1}{2})

I_{s i n^{2} \frac{θ}{2}} (\frac{d - 1}{2}, \frac{d - 1}{2})

P [W^{⊤} x \in [- r, t]] = I_{\frac{t}{2 r} + \frac{1}{2}} (\frac{d - 1}{2}, \frac{d - 1}{2}),

P [W^{⊤} x \in [- r, t]] = I_{\frac{t}{2 r} + \frac{1}{2}} (\frac{d - 1}{2}, \frac{d - 1}{2}),

P [X \in [0, \frac{t}{2 r} + \frac{1}{2}]]

P [X \in [0, \frac{t}{2 r} + \frac{1}{2}]]

= P [W^{⊤} x \in [- r, t]]

= I_{\frac{t}{2 r} + \frac{1}{2}} (\frac{d - 1}{2}, \frac{d - 1}{2}),

P [X \in [0, x]] = I_{x} (\frac{d - 1}{2}, \frac{d - 1}{2}) .

P [X \in [0, x]] = I_{x} (\frac{d - 1}{2}, \frac{d - 1}{2}) .

F_{d} (z) : = n = 0 \sum \infty \frac{( d - 2 )!!}{( 2 n )!! ( d + 2 n - 2 )!!} z^{2 n} = exp (- z) (k = 0 \sum \infty (j = 0 \prod k - 1 \frac{( d - 1 ) /2 + j}{d - 1 + j}) \frac{2 ^{k}}{k !} z^{k}) .

F_{d} (z) : = n = 0 \sum \infty \frac{( d - 2 )!!}{( 2 n )!! ( d + 2 n - 2 )!!} z^{2 n} = exp (- z) (k = 0 \sum \infty (j = 0 \prod k - 1 \frac{( d - 1 ) /2 + j}{d - 1 + j}) \frac{2 ^{k}}{k !} z^{k}) .

exp (- z) (k = 0 \sum \infty (j = 0 \prod k - 1 \frac{( d - 1 ) /2 + j}{d - 1 + j}) \frac{2 ^{k}}{k !} z^{k})

exp (- z) (k = 0 \sum \infty (j = 0 \prod k - 1 \frac{( d - 1 ) /2 + j}{d - 1 + j}) \frac{2 ^{k}}{k !} z^{k})

=

=

a_{n} (d) : = k = 0 \sum n \frac{( - 1 ) ^{k} 2 ^{k} ( \frac{d - 1}{2} ) _{k}}{( n - k )! k ! ( d - 1 ) _{k}} = {0 \frac{( d - 2 )!!}{n !! ( d + n - 2 )!!} n odd n even .

a_{n} (d) : = k = 0 \sum n \frac{( - 1 ) ^{k} 2 ^{k} ( \frac{d - 1}{2} ) _{k}}{( n - k )! k ! ( d - 1 ) _{k}} = {0 \frac{( d - 2 )!!}{n !! ( d + n - 2 )!!} n odd n even .

a_{n} (d) = \frac{1}{n !} k = 0 \sum n (- 1)^{k} (k n) \frac{( \frac{d - 1}{2} ) _{k}}{( d - 1 ) _{k}} 2^{k} = \frac{1}{n !}_{2} F_{1} (- n, \frac{d - 1}{2}; d - 1; 2),

a_{n} (d) = \frac{1}{n !} k = 0 \sum n (- 1)^{k} (k n) \frac{( \frac{d - 1}{2} ) _{k}}{( d - 1 ) _{k}} 2^{k} = \frac{1}{n !}_{2} F_{1} (- n, \frac{d - 1}{2}; d - 1; 2),

_{2} F_{1} (a, b; c; z) = n = 0 \sum \infty \frac{( a ) _{n} ( b ) _{n}}{( c ) _{n}} \frac{z ^{n}}{n !}

_{2} F_{1} (a, b; c; z) = n = 0 \sum \infty \frac{( a ) _{n} ( b ) _{n}}{( c ) _{n}} \frac{z ^{n}}{n !}

_{2} F_{1} (- n, \frac{d - 1}{2}; d - 1; 2) = \frac{Γ ( d - 1 )}{Γ ( \frac{d - 1}{2} ) ^{2}} \int_{0}^{1} t^{\frac{d - 3}{2}} (1 - t)^{\frac{d - 3}{2}} (1 - 2 t)^{n} d t .

_{2} F_{1} (- n, \frac{d - 1}{2}; d - 1; 2) = \frac{Γ ( d - 1 )}{Γ ( \frac{d - 1}{2} ) ^{2}} \int_{0}^{1} t^{\frac{d - 3}{2}} (1 - t)^{\frac{d - 3}{2}} (1 - 2 t)^{n} d t .

\int_{0}^{1} t^{\frac{d - 3}{2}} (1 - t)^{\frac{d - 3}{2}} (1 - 2 t)^{n} d t

\int_{0}^{1} t^{\frac{d - 3}{2}} (1 - t)^{\frac{d - 3}{2}} (1 - 2 t)^{n} d t

= 2^{n} \int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 3}{2}} x^{n} d x .

\int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 3}{2}} x^{n} d x = \frac{n - 1}{d - 1} \int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 1}{2}} x^{n - 2} d x .

\int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 3}{2}} x^{n} d x = \frac{n - 1}{d - 1} \int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 1}{2}} x^{n - 2} d x .

\int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 3}{2}} x^{n} d x = \frac{n - 1}{d - 1} \cdot \frac{n - 3}{d + 1} \cdot \dots \cdot \frac{1}{d + n - 3} \int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 1}{2}} d x .

\int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 3}{2}} x^{n} d x = \frac{n - 1}{d - 1} \cdot \frac{n - 3}{d + 1} \cdot \dots \cdot \frac{1}{d + n - 3} \int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 1}{2}} d x .

\int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 1}{2}} d x

\int_{- 0.5}^{0.5} (0.25 - x^{2})^{\frac{d - 1}{2}} d x

= B (\frac{d + n - 1}{2}, \frac{d + n - 1}{2})

= \frac{Γ ( \frac{d + n - 1}{2} ) ^{2}}{Γ ( d + n - 1 )},

a_{n} (d)

a_{n} (d)

= \frac{( n - 1 )!!}{n !! ( n - 1 )!!} \cdot \frac{Γ ( d - 1 )}{Γ ( \frac{d - 1}{2} ) ^{2}} \cdot 2^{n} \cdot \frac{1}{d - 1} \cdot \frac{1}{d + 1} \cdot \dots \cdot \frac{1}{d + n - 3} \cdot \frac{Γ ( \frac{d + n - 1}{2} ) ^{2}}{Γ ( d + n - 1 )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Depth Separations in Neural Networks:

What is Actually Being Separated?111Accepted for presentation at the Conference on Learning Theory (COLT) 2019

Itay Safran Ronen Eldan Ohad Shamir

Weizmann Institute of Science

{itay.safran,ronen.eldan,ohad.shamir}.weizmann.ac.il

Abstract

Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^{d}$ , which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$ . However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension $d$ (or equivalently, by scaling the function, the hardness result applies to $\mathcal{O}(1)$ -Lipschitz functions only when the target accuracy $\epsilon$ is at most $\text{poly}(1/d)$ ). In this paper, we study whether such depth separations might still hold in the natural setting of $\mathcal{O}(1)$ -Lipschitz radial functions, when $\epsilon$ does not scale with $d$ . Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it is possible to approximate $\mathcal{O}(1)$ -Lipschitz radial functions with depth $2$ , size $\text{poly}(d)$ networks, for every constant $\epsilon$ . We complement it by showing that approximating such functions is also possible with depth $2$ , size $\text{poly}(1/\epsilon)$ networks, for every constant $d$ . Finally, we show that it is not possible to have polynomial dependence in both $d,1/\epsilon$ simultaneously. Overall, our results indicate that in order to show depth separations for expressing $\mathcal{O}(1)$ -Lipschitz functions with constant accuracy – if at all possible – one would need fundamentally different techniques than existing ones in the literature.

1 Introduction

In the past few years, quite a few theoretical works have explored the beneficial effect of depth on increasing the expressiveness of neural networks (e.g., Delalleau and Bengio (2011); Martens et al. (2013); Martens and Medabalimi (2014); Montufar et al. (2014); Cohen et al. (2015); Telgarsky (2016); Eldan and Shamir (2016); Liang and Srikant (2016); Poggio et al. (2016); Poole et al. (2016); Shaham et al. (2016); Yarotsky (2016); Daniely (2017); Safran and Shamir (2017)). These works mostly focus on depth separations: namely showing that there are functions which can be expressed by a small network of a given depth, but cannot be approximated by shallower networks, even if their size is much larger. Perhaps the clearest manifestation of this is in separating depth $2$ and depth $3$ networks: There are functions $f$ and distributions $\mu$ on $\mathbb{R}^{d}$ , which are

•

Hard to approximate with a depth $2$ network: $\mathbb{E}_{\mathbf{x}\sim\mu}\left[\left(N_{2}(\mathbf{x})-f(\mathbf{x})\right)^{2}\right]\geq c$ for some absolute $c>0$ , using any depth $2$ , width $\text{poly}(d)$ network $N_{2}(\mathbf{x}):=\sum_{i=1}^{\text{poly}(d)}u_{i}\sigma(\mathbf{w}_{i}^{\top}\mathbf{x}+b_{i})$ (for some parameters $\{v_{i},\mathbf{w}_{i},b_{i}\}$ and univariate activation function $\sigma$ ).

•

Easy to approximate with a depth $3$ network: For any $\epsilon>0$ , it holds that $\mathbb{E}_{\mathbf{x}\sim\mu}[(N_{3}(\mathbf{x})-f(\mathbf{x}))^{2}]\leq\epsilon$ (or sometimes even $\sup_{\mathbf{x}}|N_{3}(\mathbf{x})-f(\mathbf{x})|\leq\epsilon$ ) for some depth $3$ , width $\text{poly}(d,1/\epsilon)$ neural network network $N_{3}(\mathbf{x}):=\sum_{i=1}^{\text{poly}(d,1/\epsilon)}u_{i}\sigma\left(N_{2}^{i}(\mathbf{x})+b_{i}\right)$ (where each $N_{2}^{i}$ is a depth $2$ , width $\text{poly}(d,1/\epsilon)$ network, and $\sigma$ is a standard activation such as a ReLU).

Eldan and Shamir (2016) (as well as a related construction in Safran and Shamir (2017)) prove such a lower bound unconditionally, whereas Daniely (2017) show this with a simple proof, assuming that the parameters of the network cannot be too large. Moreover, these “hard” functions have a simple form: They are essentially radial functions222Eldan and Shamir (2016) use a radial function. Daniely (2017) use a function which is easily reduced to a radial one – see next paragraph. of the form $f(\mathbf{x})=g(\left|\left|\mathbf{x}\right|\right|)$ for a univariate function $g$ . Such radial functions are of interest in learning theory, since there are function classes that are essentially a mixture of radial functions (e.g. Gaussian kernels), and they are essential primitives in expressing functions which involve Euclidean distances. The intuition for the above separations is that radial functions can be easily approximated with depth $3$ networks, by first approximating the $\mathbf{x}\mapsto\left|\left|\mathbf{x}\right|\right|^{2}=\sum_{i}x_{i}^{2}$ function in the first layer, and then approximating the univariate function $g(\sqrt{\cdot~{}})$ in the next layers. In contrast, approximating high-dimensional radial functions with depth $2$ networks appears to be difficult, since they are, in a sense, the furthest away from functions which depend on only a single direction (see Fig. 1). Overall, these results appear to provide a clear separation between the required widths of depth $2$ and depth $3$ networks, in terms of the dimension $d$ .

However, a closer inspection of the constructions above reveals that in fact, this is not so clear. The reason is that the functions which are shown to be provably hard for depth $2$ networks are rapidly oscillating, and require a Lipschitz constant (at least) polynomial in $d$ to even approximate: In Eldan and Shamir (2016), the function has the form $\mathbf{x}\mapsto\sum_{i=1}^{\Theta(d^{2})}\epsilon_{i}\mathbbm{1}(\left|\left|\mathbf{x}\right|\right|\in[a_{i},b_{i}])$ , over a distribution supported on $\mathbb{R}^{d}$ (where $\epsilon_{i}\in\{-1,+1\}$ , $\mathbbm{1}$ is the indicator function, and $[a_{i},b_{i}]$ are disjoint intervals in the range $\Theta(\sqrt{d})$ ). In Daniely (2017), the function used is easily reduced to $\sin(2\pi d^{3}\left|\left|\mathbf{x}\right|\right|^{2})$ (see proof of Thm. 4). Having such rapidly oscillating functions is not always a natural regime, since we are often interested in functions whose Lipschitz parameter is independent of the dimension. For example, in learning theory, this is actually needed to obtain dimension-free learnability results for convex functions (Shalev-Shwartz and Ben-David, 2014, Chapter 12). Moreover, there is evidence that functions which oscillate too rapidly can be computationally difficult for neural networks to learn with standard gradient-based methods (e.g., (Song et al., 2017; Shalev-Shwartz et al., 2017; Shamir, 2018; Abbe and Sandon, 2018)), so Lipschitz functions are arguably more interesting from a learning perspective. Overall, we are lead to the following natural question:

Can we show a depth $2$ vs. $3$ separation result in terms of the dimension $d$ , even for approximating $\mathcal{O}(1)$ -Lipschitz functions?

In other words, are there $\mathcal{O}(1)$ -Lipschitz functions which cannot be approximated by depth $2$ , width- $\text{poly}(d)$ networks, but can be approximated by depth $3$ , width- $\text{poly}(d)$ networks?

To study this, we first notice that it is easy to reduce any hardness result for approximating $L$ -Lipschitz functions to accuracy $\epsilon$ , to hardness of approximating $1$ -Lipschitz functions to accuracy $\epsilon/L$ , simply by scaling the functions by $1/L$ . Moreover, we can even reduce the hardness result to a $1$ -Lipschitz function with accuracy $\epsilon$ , by dilating the measure we are using by a factor $L$ (see Appendix A for a formal statement). However, now the lower bounds require either the accuracy $\epsilon$ or the diameter of the support of the distribution to scale polynomially with $d$ . As a result, when saying that the depth $2$ networks require width super-polynomial in $d$ , it is not clear whether the hardness really comes from the dimension $d$ , or perhaps from other parameters which are being forced to scale with it, such as the accuracy $\epsilon$ . Thus, we rephrase our question as follows:

Can we show a depth $2$ vs. depth $3$ neural network separation result in terms of the dimension $d$ , for approximating $\mathcal{O}(1)$ -Lipschitz functions up to constant accuracy $\epsilon$ on a domain of bounded radius (all independent of $d$ )?

The intuition described earlier (on the difficulty of approximating radial functions in high dimensions) seems to suggest that the answer is positive.

Our Results. In this paper, we show that perhaps surprisingly, the answer to the question above is actually negative (at least for radial functions): For any constant $\epsilon$ , it is possible to approximate radial functions using $\text{poly}(d)$ -width, depth $2$ networks. More precisely, our upper bound on the required size is $\exp\left(\mathcal{O}\left(\epsilon^{-9}\log(d/\epsilon)\right)\right)$ (see Thm. 1). We also complement this by showing that for constant dimension $d$ , approximation of any $\mathcal{O}(1)$ -Lipschitz radial function is possible with $\text{poly}(1/\epsilon)$ -width, depth $2$ networks: Specifically, the bound is $\exp\left(\mathcal{O}\left(d\log(1/\epsilon)\right)\right)$ (see Thm. 3). Both bounds are $L_{\infty}$ -type approximation results, with respect to the unit ball: Namely, given a function $f$ , we show how to find a neural network $n(\cdot)$ such that

[TABLE]

where $B_{d}:=\{\mathbf{x}\in\mathbb{R}^{d}:\left|\left|\mathbf{x}\right|\right|\leq 1\}$ . This is a stronger approximation guarantee than $L_{2}$ -type approximation guarantees (where we bound $\mathbb{E}_{\mathbf{x}\sim\mu}[(n(\mathbf{x})-f(\mathbf{x}))^{2}]$ for some distribution $\mu$ on $B_{d}$ ), since a bound on the former implies a bound on the latter. Furthermore, we show that any even radial monomial, namely a radial function of the form $\mathbf{x}\mapsto\left|\left|\mathbf{x}\right|\right|^{2k}$ , for any fixed natural $k$ , can be approximated to accuracy $\epsilon$ using a depth $2$ network of width polynomial in both $d$ and $1/\epsilon$ . Finally, we formally prove (using a reduction from Eldan and Shamir (2016); Daniely (2017), and using their assumptions) that it is impossible to obtain a general polynomial dependence on both $d$ and $1/\epsilon$ in our setting (see Thm. 4 and Thm. 5). Overall, these results show that to approximate radial functions with depth $2$ networks, their width can be polynomial in either $d$ or $1/\epsilon$ , but generally not in both. Putting this in the context of known depth separations, our results indicate that the difficulty in approximating the “hard” functions used in separating depth 2 from depth 3 stems from both the input dimension $d$ and the accuracy parameter $\epsilon$ simultaneously, and not from either one alone.

It is interesting to note that such trade-offs between dimension and accuracy also appear in very different areas of learning theory. For example, consider the classic problem of agnostically learning halfspaces up to excess error $\epsilon$ in $d$ dimensions (see for example Kalai et al. (2008)): It is folklore that for well-behaved input distributions, one can learn a halfspace in runtime $\text{poly}(1/\epsilon)$ for constant $d$ (simply by creating an $\epsilon$ -net of all possible halfspaces, and picking the best one on a training data). On the other hand, it is also known that one can learn in runtime $\text{poly}(d)$ for any constant $\epsilon$ , at least for certain input distributions (Kalai et al., 2008). However, there is evidence that being polynomial over both $d,1/\epsilon$ is not possible in those settings (Klivans and Kothari, 2014).

Finally, we emphasize that our results still do not fully settle the question stated above, since there might be depth separation results using functions which cannot be reduced to radial ones. However, such results do not exist at the present time, and we believe that our observations may also be relevant for more general families of functions. In any case, we hope our paper would motivate and guide further study of this question.

2 Main Results

In this section, we present our main results and the high-level proof components. The remainder of the proofs are provided in Sec. 3.

2.1 Approximation with Width $\text{poly}(d)$ Networks

We first present our formal result, implying that radial functions can be approximated with depth $2$ , width $\text{poly}(d)$ networks, to any constant accuracy $\epsilon$ . We prove this result for networks employing any activation function $\sigma(\cdot)$ which satisfies the following mild assumption (taken from Eldan and Shamir (2016)), which implies that the activation can be used to approximate univariate functions well. This assumption is satisfied for all standard activations, such as ReLU and sigmoidal functions (see reference above for further discussion):

Assumption 1.

Given the activation function $\sigma$ , there is a constant $c_{\sigma}\geq 1$ (depending only on $\sigma$ ) such that the following holds: For any $L$ -Lipschitz function $f:\mathbb{R}\to\mathbb{R}$ which is constant outside a bounded interval $[-R,R]$ , and for any $\delta$ , there exist scalars $a,\left\{\alpha_{i},\beta_{i},\gamma_{i}\right\}_{i=1}^{w}$ , where $w\leq c_{\sigma}\frac{RL}{\delta}$ , such that the function

[TABLE]

satisfies

[TABLE]

Our main result for this subsection is the following:

Theorem 1.

Suppose $\sigma:\mathbb{R}\to\mathbb{R}$ satisfies Assumption 1. Then for any $\epsilon>0$ and any $1$ -Lipschitz radial function $f(\mathbf{x})=\varphi(\left|\left|\mathbf{x}\right|\right|)$ , there exists a depth $2$ neural network with $\sigma$ activations and width $w\leq\exp\left(\mathcal{O}\left(\epsilon^{-9}\log(d/\epsilon)\right)\right)$ satisfying

[TABLE]

where the big O notation hides a constant that depends solely on $\sigma$ .

We note that the $1/\epsilon^{9}$ exponent might be improvable to some smaller polynomial in $1/\epsilon$ (see proof for details), but overall the dependence on $1/\epsilon$ remains at least exponential.

The proof of this theorem requires several intermediate results about the approximation capabilities of depth $2$ networks, some of which may be of independent interest. The high-level strategy is the following:

•

First, we consider depth $2$ networks $\sum_{i}v_{i}\sigma(\mathbf{w}_{i}^{\top}\mathbf{x}+b_{i})$ , where $\sigma(z)=\exp(z)$ is the exponential function. Using properties of the beta distribution, we show that if the weights $\mathbf{w}_{i}$ are drawn uniformly and independently from the unit sphere (and $v_{i},b_{i}$ are fixed appropriately), then the resulting network $N$ satisfies $\mathbb{E}[N(\mathbf{x})]=F_{d}(\left|\left|\mathbf{x}\right|\right|)$ for some complicated function $F_{d}$ , which depends however only on the norm of $\mathbf{x}$ . Using concentration of measure, we show that the above implies $N(\mathbf{x})\approx F_{d}(\left|\left|\mathbf{x}\right|\right|)$ if the width is sufficiently large (Thm. 6).

•

Next, we use Assumption 1 to show that we can construct a bounded-width network $N(\cdot)$ with any $\sigma$ -activation (not just an exponential one), such that $N(\mathbf{x})\approx F_{d}(\left|\left|\mathbf{x}\right|\right|)$ (Thm. 7).

•

Using a Taylor series argument, we show that a careful linear combination of (not too many) scaled versions of $F_{d}(\left|\left|\mathbf{x}\right|\right|)$ allow us to approximate any even monomial $\left|\left|\mathbf{x}\right|\right|^{2k}$ in the norm of $\mathbf{x}$ . Since a linear combination of depth $2$ networks is still a depth $2$ network, this implies that we can approximate $\left|\left|\mathbf{x}\right|\right|^{2k}$ with some depth $2$ network, again with bounded width (Thm. 8).

•

Finally, we use a quantitative version of Weierstrass’ approximation theorem, to show that we can approximate any Lipschitz radial function $\varphi(\left|\left|\mathbf{x}\right|\right|)$ (where $\varphi$ is on $[0,1]$ ) by a linear combination of even monomials (Lemma 4). Again, this implies that we can find a bounded-width depth $2$ network which approximates this radial function well.

2.2 Approximation with Width $\text{poly}(1/\epsilon)$ Networks

Having considered depth $2$ , width $\text{poly}(d)$ networks (for constant accuracy $\epsilon$ ), we now turn to consider the complementary setting, where the dimension $d$ is fixed, and we show how Lipschitz radial functions can be approximated by width $\text{poly}(1/\epsilon)$ networks. This setting is closer in spirit to universal approximation theorems for depth $2$ networks (namely, on how such networks can approximate any continuous function on a compact domain, if we allow exponential dependencies on $d$ ). Unfortunately, most such theorems are not quantitative in nature, and do not imply polynomial dependence on $\epsilon$ . A noteworthy exception is the line of work pioneered by Barron (see Barron (1993)), which provide quantitative approximation guarantees in terms of the width and moments of the Fourier transform of the target function $f$ . Our main technical contribution here is to show how we can translate such moment-based bounds to a bound applicable to any Lipschitz radial function. For concreteness, we will focus here on networks employing the common ReLU activations (i.e. $\sigma(z)=\max\{0,z\}$ ), although the technique is applicable more generally. We make use of the following recent result from Klusowski and Barron (2018, Theorem 2), which provides an $L_{\infty}$ approximation guarantee for ReLU networks:

Theorem 2 (Klusowski and Barron (2018)).

Let $D=[-1,1]^{d}$ . Suppose $f$ admits a Fourier representation $f(\mathbf{x})=\int_{\mathbb{R}^{d}}\exp(i\langle\mathbf{x},\omega\rangle)\mathcal{F}(f)(\omega)d\omega$ and

[TABLE]

Then there exist depth $2$ ReLU networks $f_{n}$ , each of width $n+2$ such that for all $n$

[TABLE]

for some universal constant $c>0$ .

Note that in their original theorem statement, Klusowski and Barron (2018) define the ReLU networks as having an additional linear $\langle a_{0},\mathbf{x}\rangle$ term, which we for convenience write as a sum of two ReLU neurons $\langle a_{0},\mathbf{x}\rangle=\left[\langle a_{0},\mathbf{x}\rangle\right]_{+}+\left[\langle-a_{0},\mathbf{x}\rangle\right]_{+}$ and thus omit it from the theorem statement.

We now turn to formally state the main result of this subsection.

Theorem 3.

Suppose $f(\mathbf{x})=\varphi(\left|\left|\mathbf{x}\right|\right|)$ is a $1$ -Lipschitz radial function on $B_{d}$ . Then there exists a depth $2$ ReLU neural network $N$ , of width $n=\exp\left(\mathcal{O}\left(d\log\left(1/\epsilon\right)\right)\right)$ such that

[TABLE]

The proof (in Sec. 3) utilizes Thm. 2, with the main challenge being that even for a $1$ -Lipschitz radial $f$ , the coefficient $v_{f,2}$ might be unbounded. Instead, we consider a smoothed approximation $g=f\star\gamma_{\epsilon^{2}/4d}$ , where $\star$ is the convolution operation and $\gamma_{\epsilon^{2}/4d}$ is the Gaussian pdf with mean $\mathbf{0}$ and covariance matrix $\frac{\epsilon^{2}}{4d}I$ . Since $f$ is Lipschitz, this function is $\mathcal{O}(\epsilon)$ -close to $f$ at any point $\mathbf{x}$ . Therefore, to approximate $f$ well, it is sufficient to approximate $g$ well. Moreover, since $g$ represents a convolution with a smooth function, then it is smooth, and therefore its Fourier transform has a rapidly decaying tail. This implies that the coefficient $v_{g,2}$ is bounded (in a manner exponential in $d$ but polynomial in $1/\epsilon$ ), and an application of Thm. 2 implies the result.

2.3 Impossibility to Approximate with Width $\text{poly}(d,1/\epsilon)$

Networks

In this subsection, we complement our previous positive approximation results with negative results. Specifically, we provide two lower bounds, which imply that there are $1$ -Lipschitz radial functions, which cannot be approximated to accuracy $\epsilon$ on the unit ball $B_{d}$ , using depth $2$ , width $\text{poly}(d,1/\epsilon)$ networks (see Fig. 2). In a sense, this was already shown in Daniely (2017); Eldan and Shamir (2016), as discussed in the introduction. However, a bit of work is needed to apply them to our setting: For example, the result in Eldan and Shamir (2016) is for a radial function, but not a Lipschitz one, and the result in Daniely (2017) is not for a radial function.

Since our results are based on reductions from these papers, we need to make similar assumptions. In particular, we need to require either having an approximation on an unbounded domain, or that the approximating network’s parameters are at most exponential in $d$ . To the best of our knowledge, it remains a major open problem to prove a depth separation result without either of these two assumptions (namely, on a compact domain such as $B_{d}$ , and without restrictions on the magnitude of the parameters).

Theorem 4.

The following holds for some positive universal constants $c_{1},c_{2}$ , and any depth $2$ network employing a ReLU activation function. Consider the $1$ -Lipschitz function $f(\mathbf{x})=\frac{1}{2\pi d^{3}}\sin\left(2\pi d^{3}\left|\left|\mathbf{x}\right|\right|_{2}^{2}\right)$ on $B_{d}$ . Suppose $N$ is a depth $2$ network of width $w(d,1/\epsilon)$ , with weights bounded by $\frac{2^{d+1}}{2\pi d^{3}}$ , and satisfying $\sup_{\mathbf{x}\in B_{d}}\left|N(\mathbf{x})-f(\mathbf{x})\right|\leq\epsilon$ for any $\epsilon>0$ and any $d\geq 2$ . Then for any $d>c_{1}$ ,

[TABLE]

In particular, depth $2$ networks of width $\textnormal{poly}(d,1/\epsilon)$ cannot approximate $f$ to accuracy $\epsilon$ .

We remark that the impossibility result provided in the theorem above is in terms of $L_{\infty}$ -type approximation, namely $\sup_{\mathbf{x}}|n(\mathbf{x})-f(\mathbf{x})|$ rather than $\mathbb{E}_{\mathbf{x}}\left[n(\mathbf{x})-f(\mathbf{x})\right]^{2}$ . This is for simplicity and to make the setting complementary to our positive results from earlier (however, extending it to $L_{2}$ approximation results is not too difficult).

Theorem 5.

The following holds for some positive universal constants $c_{1},c_{2},c_{3},c_{4}$ , and any network employing an activation function satisfying Assumptions 1 and 2 in Eldan and Shamir (2016). Let $f(\mathbf{x})=\max\left\{0,-\left|\left|\mathbf{x}\right|\right|+1\right\}$ . For any $d>c_{1}$ , there exists a continuous probability distribution on $\mathbb{R}^{d}$ , such that for any $\epsilon>0$ , and any depth $2$ neural network $N$ satisfying $\left|\left|N(\mathbf{x})-f(\mathbf{x})\right|\right|_{L_{2}}\leq\epsilon$ and having width $w(d,1/\epsilon)$ , it must hold that

[TABLE]

In particular, depth $2$ networks of width $\textnormal{poly}(d,1/\epsilon)$ cannot approximate $f$ to accuracy $\epsilon$ .

3 Proofs

3.1 Proof of Thm. 1

We begin by stating the following theorem, which establishes the capability of exponential networks to approximate a particular radial function, which we denote by $F_{d}$ . Our construction for approximating $F_{d}$ uses random weights, resulting in a random network which is significantly easier to analyze when exponential activations are considered (basically, since the exponent of a random variable $X$ is its moment generating function, which for many distributions is well-known and studied).

Theorem 6.

For an integer $k$ , define $k!!=\prod_{i=0}^{\left\lceil k/2\right\rceil-1}(k-2i)$ . For any $\epsilon>0$ and natural $d\geq 2$ there exists an exponential depth $2$ neural network $N(\mathbf{x})=\sum_{i=1}^{n}v_{i}\exp\left(\mathbf{w}_{i}^{\top}\mathbf{x}\right)$ on $\mathbb{R}^{d}$ , of width $n=\left\lceil\frac{36}{\epsilon^{2}}\right\rceil$ , hidden layer weights $\mathbf{w}_{i}$ satisfying $\mathbf{w}_{i}\in\mathbb{S}^{d-1}$ (the unit sphere), and $|v_{i}|\leq\frac{1}{n}$ , such that

[TABLE]

where for all $z\in[0,1]$ ,

[TABLE]

The proof of Thm. 6 relies on the observation that by drawing $\mathbf{w}_{i}$ uniformly from the unit sphere, the neuron $\exp(\mathbf{w}_{i}^{\top}\mathbf{x})$ has an expected value equal to $F_{d}(\left|\left|\mathbf{x}\right|\right|)$ . Setting $v_{i}=\frac{1}{n}$ for all $i$ , and sampling each $\mathbf{w}_{i}$ independently, we have from concentration of measure that the resulting network $\frac{1}{n}\sum_{i=1}^{n}\exp(\mathbf{w}_{i}^{\top}\mathbf{x})$ gradually converges to this expected value, effectively approximating $F_{d}$ . Before we prove Thm. 6 however, we would need to evaluate the distribution of the dot product of such a random neuron with its input, as well as derive an equivalent representation of $F_{d}$ which we will encounter when proving the theorem. To this end, we have the following two lemmas:

Lemma 1.

Suppose $\mathbf{x}\in\mathbb{R}^{d}$ such that $\left|\left|\mathbf{x}\right|\right|=r$ , and suppose $W\in\mathbb{R}^{d}$ is distributed uniformly on the $d$ -dimensional unit sphere. Then the random variable $X=\frac{1}{2r}W^{\top}\mathbf{x}+\frac{1}{2}$ follows a $\textnormal{Beta}\left(\frac{d-1}{2},\frac{d-1}{2}\right)$ distribution.

Proof.

Since $W$ is invariant to orthogonal transformations, we may assume w.l.o.g. that $\mathbf{x}$ is of the form $\mathbf{x}=(r,0,\dots,0)$ . That is, $W^{\top}\mathbf{x}=W_{1}r$ , where $W_{1}$ is the first coordinate of $W$ . Therefore to determine the distribution of $W^{\top}\mathbf{x}$ , it suffices to compute the probability of $W_{1}r$ falling in the interval $\left[-r,t\right]$ for $t\in\left[-r,r\right]$ , or equivalently, $W_{1}$ falling in the interval $\left[-1,\frac{t}{r}\right]$ . Since $W$ is distributed uniformly on the unit sphere, this is proportional to the area of a hyperspherical cap centered at $\mathbf{a}=(-1,0,\ldots,0)$ , and defined by $\mathsf{S}^{d-1}(\mathbf{a},\theta)\coloneqq\left\{\mathbf{b}\in\mathbb{S}^{d-1}:\arccos(\langle\mathbf{a},\mathbf{b}\rangle)\leq\theta\right\}$ . This probability is given in terms of the regularized incomplete beta function as

[TABLE]

(Leopardi, 2007, Lemma 2.3.15.), where the spherical radius of the cap, $\theta$ , satisfies $\frac{t}{r}=\cos(\pi-\theta)=-\cos(\theta)$ . Elementary trigonometry reveals that under this condition, it must hold that $\sin^{2}\left(\frac{\theta}{2}\right)=\frac{t}{2r}+\frac{1}{2}$ , namely we have

[TABLE]

implying that

[TABLE]

i.e., by the change of variables $x=\frac{t}{2r}+\frac{1}{2}$ we have

[TABLE]

It follows immediately that $X$ is $\textnormal{Beta}\left(\frac{d-1}{2},\frac{d-1}{2}\right)$ distributed, concluding the proof of the lemma. ∎

Lemma 2.

We have

[TABLE]

Proof.

Letting $(x)_{k}=\prod_{j=0}^{k-1}(x+j)$ , we compute

[TABLE]

Since $(-1)^{n-k}=(-1)^{k}$ for even $n$ and $(-1)^{n-k}=-(-1)^{k}$ for odd $n$ , it suffices to show that for any natural $n$ and any integer $d\geq 2$ ,

[TABLE]

We rewrite

[TABLE]

where

[TABLE]

is the Gauss hypergeometric function. Using Euler’s integral formula for the Gauss hypergeometric function (Andrews et al., 1999, p. 65, Theorem 2.2.1) yields

[TABLE]

Simplifying the integral in Eq. (3), we substitute $t=0.5-x$ , $dt=-dx$ to get

[TABLE]

Clearly, the integrand in Eq. (4) is an odd function when $n$ is odd, therefore $a_{n}(d)=0$ for any odd $n$ . For even $n$ , integration by parts of $(0.25-x^{2})^{\frac{d-3}{2}}x$ and $x^{n-1}$ reveals that

[TABLE]

Recursively applying the relation in Eq. (5) yields

[TABLE]

Substituting $x=0.5-t$ back in the integral in Eq. (6) gives

[TABLE]

where $B(x,y)$ denotes the Beta function. Finally, substituting our calculations from Equations (7,6,4,3) in Eq. (2), and using the identities $\Gamma(z+1)=z\Gamma(z)$ which holds for any real $z\geq 0$ , and $\Gamma(z+1)=z!=z!!(z-1)!!$ which holds for any integer $z$ , we have

[TABLE]

∎

We are now ready to prove Thm. 6.

Proof of Thm. 6.

Consider a depth $2$ network of width $n$ , where $n$ is to be determined later, with exponential activations, [math] bias terms in the hidden layer, equal weights of $1/n$ in the output neuron, and where the weights of each hidden neuron are sampled i.i.d. uniformly at random from the unit hypersphere $\mathbb{S}^{d-1}$ . Fix $r$ such that $\left|\left|\mathbf{x}\right|\right|=r$ , then we have from Lemma 1 that the network computes the random function

[TABLE]

where $X_{i}\sim\textnormal{Beta}(\frac{d-1}{2},\frac{d-1}{2})$ are i.i.d. Taking expectation in Eq. (8) yields.

[TABLE]

Letting $t=2r$ gives

[TABLE]

Conveniently, the expectation in the right hand side of Eq. (9) is exactly the moment generating function of a $\textnormal{Beta}(\frac{d-1}{2},\frac{d-1}{2})$ random variable, given by

[TABLE]

(Gupta and Nadarajah, 2004). By virtue of Lemma 2, Eq. (9) therefore reduces to

[TABLE]

To convert the above expectation equality to a uniform convergence bound we shall use a Rademacher complexity argument. We have that the approximation error is

[TABLE]

This is equivalent to bounding the uniform convergence of the function class $\mathcal{F}:=\{W\mapsto\exp(W^{\top}\mathbf{x}):\mathbf{x}\in B_{d}\}$ , whose values are bounded in $[\exp(-1),\exp(1)]$ . By standard Rademacher complexity arguments, it is well-known that this is upper bounded by $\mathcal{O}(\sqrt{\log(1/\delta)/n})$ with probability at least $1-\delta$ . Specifically, letting $\phi(z):=\frac{\exp(z)-1}{\exp(1)-1}$ , we can rewrite Eq. (11) as

[TABLE]

Defining the function class $\mathcal{F}^{\prime}:=\{W\mapsto W^{\top}\mathbf{x}:\mathbf{x}\in B_{d})\}$ , we can upper bound the above (with probability at least $1-\delta$ over the sampling of $W_{1},\ldots,W_{n}$ ) by

[TABLE]

where $R_{n}(\phi\circ\mathcal{F}^{\prime}(W_{1},\ldots,W_{n})):=\mathbb{E}\sup_{f\in\mathcal{F}^{\prime}}\ \left|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}\phi(f(W_{i}))\right|$ is the (empirical) Rademacher complexity of $\mathcal{F}$ , and the expectation is over $\sigma_{1},\ldots,\sigma_{n}$ which are sampled independently and uniformly from $\{-1,+1\}$ (see Boucheron et al. (2005, Theorem 3.2)). Since $W^{\top}\mathbf{x}$ takes values in $[-1,+1]$ , $\phi$ is $1$ -Lipschitz in that domain and $\phi(0)=0$ , we can upper bound the above by

[TABLE]

(see Boucheron et al. (2005, Theorem 3.3)). Finally, since $F^{\prime}$ consists of $1$ -Lipschitz linear functions over the unit ball, we have that $R_{n}(F^{\prime}(W_{1},\ldots,W_{n}))\leq\sqrt{1/n}$ (see Boucheron et al. (2005, Corollary 4.3)). Overall, we get that Eq. (11) is at most $(\exp(1)-1)\left(\frac{2}{\sqrt{n}}+\sqrt{\frac{2\log(2/\delta)}{n}}\right)$ . Picking $\delta=3/4$ , this can be upper bounded by $6/\sqrt{n}$ . In particular, this means that there exist some realizations of $W_{1},\ldots,W_{n}$ such that Eq. (11) is at most $6/\sqrt{n}$ . In other words, for any $\epsilon>0$ , if we set $n\geq\frac{36}{\epsilon^{2}}$ , we have a depth $2$ Linear network of width $n$ which approximates $F_{d}(\left|\left|\mathbf{x}\right|\right|)$ up to error $\epsilon$ .

∎

Albeit useful for establishing Thm. 6, exponential activations are uncommon in practice. To translate Thm. 6 to work with more commonly used activations, we utilize the universality of activations satisfying Assumption 1 to approximate an exponential function on a bounded domain to arbitrary accuracy, resulting in a network approximating $F_{d}$ for a wide family of activation functions. More formally, we have the following theorem:

Theorem 7.

Suppose $\sigma:\mathbb{R}\to\mathbb{R}$ is an activation satisfying Assumption 1. Then for any $\epsilon>0$ and natural $d\geq 2$ there exists a depth $2$ neural network $N:\mathbb{R}^{d}\to\mathbb{R}$ with $\sigma$ activations of width at most $c_{\sigma}\epsilon^{-3}$ , satisfying $\sup_{\mathbf{x}\in B_{d}}\left|N(\mathbf{x})-F_{d}\left(\left|\left|\mathbf{x}\right|\right|\right)\right|\leq\epsilon$ , where $c_{\sigma}>0$ depends solely on $\sigma$ .

Proof of Thm. 7.

First, invoke Thm. 6 to obtain a width $n=\left\lceil 144\epsilon^{-2}\right\rceil$ exponential network satisfying

[TABLE]

Next, using Assumption 1, we obtain a depth $2$ $\sigma$ network approximating the exponential $z\mapsto\exp(z)$ on the unit interval $[0,1]$ , having width at most $c_{\sigma}e/\epsilon$ . Denote this network as $N_{\exp}$ , we construct a $\sigma$ network approximating $F_{d}$ as follows: For each hidden weight $\mathbf{w}_{i}$ of $N$ , we take a copy of $N_{\exp}$ and feed it with $\langle\mathbf{w}_{i},\mathbf{x}\rangle$ to obtain $N_{\exp}(\langle\mathbf{w}_{i},\mathbf{x}\rangle)$ . Note that $N_{\exp}(\langle\mathbf{w}_{i},\mathbf{x}\rangle)$ is a depth $2$ $\sigma$ network since the linear transformation $\langle\mathbf{w}_{i},\mathbf{x}\rangle$ can be simulated by modifying the hidden layer of $N_{\exp}$ to compute it exactly. Defining the network $\tilde{N}(\mathbf{x})=\frac{1}{n}\sum_{i=1}^{n}N_{\exp}(\langle\mathbf{w}_{i},\mathbf{x}\rangle)$ , which is also a depth $2$ network of width $c_{\sigma}\epsilon^{-3}$ as a weighted combination of networks (absorbing any absolute constants into $c_{\sigma}$ ). We now compute using Eq. (12) for any $\mathbf{x}\in B_{d}$

[TABLE]

Where we note that the boundedness of the weights of the hidden layer of $N$ and the Cauchy-Schwarz inequality guarantee that we remain in the relevant approximation domain of $\exp(\cdot)$ , as $\langle\mathbf{w}_{i},\mathbf{x}\rangle\leq\left|\left|\mathbf{w}_{i}\right|\right|\cdot\left|\left|\mathbf{x}\right|\right|\leq 1$ . ∎

Thm. 7 allows us to approximate the family of functions $F_{d}(\left|\left|\cdot\right|\right|)$ efficiently using depth $2$ networks with a variety of activations. The following theorem utilizes Thm. 7 to approximate even radial monomials. i.e., radial functions of the form $\left|\left|\mathbf{x}\right|\right|^{2k}$ for some natural $k$ .

Theorem 8.

Suppose $\sigma:\mathbb{R}\to\mathbb{R}$ satisfies Assumption 1. Then for any $\epsilon>0$ and any natural $k\geq 1$ , there exists a depth $2$ neural network with $\sigma$ activations of width $n=\exp\left(\mathcal{O}\left(k^{2}\log\left(d/\epsilon\right)\right)\right)$ satisfying

[TABLE]

where the big O notation hides a constant that depends solely on $\sigma$ .

Interestingly, apart from its role in proving Thm. 1, Thm. 8 also shows the existence of a family of functions that are approximable to accuracy $\epsilon$ using width polynomial in both $d$ and $1/\epsilon$ ; for any fixed $k$ , the radial polynomial $\left|\left|\mathbf{x}\right|\right|^{2k}$ can be approximated by a width $\exp\left(\mathcal{O}\left(k^{2}\log\left(d/\epsilon\right)\right)\right)=\textnormal{poly}(d,1/\epsilon)$ network. Before delving into the proof of Thm. 8, however, we will first need the following lemma, which will utilize the power-series representation of $F_{d}$ to approximate polynomials of even degree. Note that at this point, the question of approximation is now reduced to a one dimensional problem, since approximating a radial $\varphi(\left|\left|\mathbf{x}\right|\right|)$ using linear combinations of $F_{d}(\left|\left|\mathbf{x}\right|\right|)$ is equivalent to approximating $\varphi(z)$ using linear combinations of $F_{d}(z)$ .

Lemma 3.

Suppose $f(z)=\sum_{k=0}^{\infty}\alpha_{2k}z^{2k}$ converges uniformly for all $z\in[0,1]$ , where $\alpha_{2k}\neq 0$ is non increasing. Then for sufficiently small $\epsilon>0$ and any $n>0$ , there exist $b_{0},\dots,b_{n},c_{1},\ldots,c_{n}\in\mathbb{R}$ , and a universal constant $c>0$ such that

[TABLE]

where $\left|c_{k}\right|\leq 1$ and $\left|b_{k}\right|\leq\alpha_{2n}^{-1}\left(\frac{2}{\epsilon}\right)^{cn^{2}}$ for all $k\in\left\{0,\dots,n\right\}$ .

The proof of Lemma 3 relies on the observation that taking an appropriately chosen linear combination of the form $f(\eta z),f(\eta^{2}z),\dots,f(\eta^{n}z)$ for some $\eta>0$ and presenting it as a power-series, results in all the coefficients of $z^{2k}$ for $k<n$ being exactly zero, the coefficient of $z^{2n}$ being $1$ , and the remaining coefficients all decaying rapidly to [math] as $\eta\to 0$ .

Proof.

Let $p(z)=\sum_{k=0}^{n}p_{2k}z^{2k}$ be some even polynomial, and consider the set of functions

[TABLE]

These have the following expansions:

[TABLE]

Equating the coefficients $t_{2i}$ , $i=1,2,\dots$ in $\sum_{i=1}^{\infty}t_{2i}z^{2i}$ , the expansion of $\sum_{k=1}^{n}b_{k}f(\eta^{k}z)$ , to the coefficients of $p(z)$ , we obtain the matrix equality

[TABLE]

where $A=\text{diag}(\boldsymbol{\alpha})$ is a diagonal matrix with the coefficients $\boldsymbol{\alpha}=\alpha_{2},\dots,\alpha_{2n}$ on its main diagonal, $V(\eta)$ is the Vandermonde matrix given by

[TABLE]

$\mathbf{b}=\left(b_{1},\dots,b_{n}\right)$ and $\mathbf{p}=\left(p_{2},\dots,p_{2n}\right)$ . Since $\alpha_{2k}\neq 0$ , and since $V(\eta)$ is invertible for small enough $\eta$ , Eq. (14) can be rearranged to

[TABLE]

Letting $b_{0}=\alpha_{0}\sum_{k=1}^{n}b_{k}$ , we have that the coefficients $t_{2i}$ up to degree $2n$ agree with $p(z)$ , thus to establish Eq. (13), it remains to bound the tail of the expansion for degrees $>2n$ . To this end, we will first bound each $b_{k}$ for $k=1,\dots,n$ . We have from Hölder’s inequality for all $b_{k}$

[TABLE]

where $V(\eta)_{k}^{-1}$ is the $k$ -th row of $V(\eta)^{-1}$ , given by

[TABLE]

(Macon and Spitzbart, 1958). Bounding $V(\eta)_{ij}^{-1}$ , we begin with the denominator to obtain for $\eta^{-1}\geq n$ that if $i<m$ then

[TABLE]

Otherwise, if $i>m$ then

[TABLE]

Therefore

[TABLE]

Thus

[TABLE]

For the numerator we have

[TABLE]

which also holds for $j=n$ . Hence we have

[TABLE]

implying the $1$ -norm of the $k$ -th row is upper bounded by

[TABLE]

Combining the above with Eq. (15), we obtain the upper bound for some $c_{2}>0$

[TABLE]

In general, the coefficient of the term $z^{2i}$ for $i>n$ in the expansion of $b_{0}+\sum_{k=1}^{n}b_{k}f(\eta^{k}z)$ is given by

[TABLE]

Taking the absolute value and combining with Eq. (3.1), we get

[TABLE]

Finally, letting $\eta=\min\left\{0.5,1/n,\epsilon/8e\right\}$ (note that this also entails $c_{k}=\eta^{k}\leq 1$ ), we can bound the tail as follows

[TABLE]

∎

With the help of Lemma 3, we now turn to prove Thm. 8.

Proof of Thm. 8.

First, note that the family of functions $F_{d}(z)$ satisfy the assumptions in Lemma 3 for any $d\geq 2$ , as readily seen by their definition. Now, letting

[TABLE]

we obtain from Lemma 3 that

[TABLE]

for coefficients $\left|c_{k}\right|\leq 1$ and $b_{k}$ satisfying

[TABLE]

To bound $d(d+2)\ldots(d+2n-2)$ , observe that $(d+2n-2)\leq d^{\log_{2}n+1}$ for any $n\geq 1$ and $d\geq 2$ , thus

[TABLE]

Plugging the above in Eq. (3.1) yields

[TABLE]

It now remains to approximate the function $F(\mathbf{x})=b_{0}+\sum_{k=1}^{n}b_{k}F_{d}\left(c_{k}\left|\left|\mathbf{x}\right|\right|\right)$ to accuracy $\epsilon/2$ (note that the $b_{0}$ is trivial, as it can be easy to simulate with a constant neuron). To this end, invoke Thm. 7 with a desired accuracy of $\frac{\epsilon}{2n\left|b_{n}\right|}$ , to obtain a network $N$ approximating $F_{d}(\left|\left|\mathbf{x}\right|\right|)$ . We stress that such approximation of $F_{d}(\left|\left|\mathbf{x}\right|\right|)$ is obtained for any $\mathbf{x}\in B_{d}$ , and since we have $\left|c_{k}\right|\leq 1$ we are guaranteed to remain in the relevant domain. Taking $n$ such copies of $N$ , we obtain a width $8c_{\sigma}n^{3}\left|b_{n}\right|^{3}\epsilon^{-3}=\exp\left(\mathcal{O}\left(n^{2}\log\left(d/\epsilon\right)\right)\right)$ network

[TABLE]

approximating $F(\mathbf{x})$ , since

[TABLE]

Combining Equations (17) and (19), we conclude that

[TABLE]

∎

Before we can prove Thm. 1, it only remains that we first prove the following lemma, establishing quantitative bounds on the ability of even polynomials having degree $n$ to approximate arbitrary $1$ -Lipschitz functions in $[0,1]$ , while having bounded coefficients. More formally, we have the following lemma:

Lemma 4.

Let $f:[0,1]\to\mathbb{R}$ be a $1$ -Lipschitz function. Then for any $\epsilon>0$ , there exists an even polynomial $p$ of degree $n=2\left\lceil 4\epsilon^{-3}\right\rceil$ such that

[TABLE]

and where the coefficients of $p$ , denoted $p_{2},p_{4},\ldots,p_{n}$ are upper bounded by $2^{n}$ .

We remark that the $-3$ exponent in the result can possibly be improved somewhat, but this will not change the exponential dependence on $1/\epsilon$ in our main theorem.

The following proof follows along a similar line as the proof provided by S. Bernstein for Weierstrass’ approximation theorem (see Koralov and Sinai (2007, Thm. 2.7) for the proof), albeit we also bound the magnitude of the coefficients of the approximating polynomial.

Proof.

Let $f:[0,1]\to\mathbb{R}$ be $1$ -Lipschitz. First, by approximating $f(z)-f(0.5)$ instead, we may assume w.l.o.g. that $f(0.5)=0$ (adding the zero degree polynomial $f(0.5)$ to our approximation once obtained). Extend $f$ to an even function on $[-1,1]$ given by

[TABLE]

Letting $g(z)=f(2z-1)$ , we linearly shift $f$ to the unit interval where $g(z)$ is $2$ -Lipschitz. Define the $n+1$ Bernstein basis polynomials of degree $n$ as

[TABLE]

It is a well known fact that these polynomials form a partition of unity for any $n$ :

[TABLE]

Define the $n$ -th Bernstein polynomial approximation of $g$ as

[TABLE]

We compute using Eq. (20)

[TABLE]

Since $g$ is $2$ -Lipschitz, we have that $\left|\frac{\nu}{n}-z\right|<\frac{\epsilon}{4}$ implies $\left|g(\frac{\nu}{n})-g(z)\right|<\frac{\epsilon}{2}$ , thus (21) is upper bounded by

[TABLE]

Recalling that $g(0.25)=g(0.75)=0$ , we have from Lipschitzness that $\sup_{z\in[0,1]}\left|g(z)\right|\leq 0.5$ . Therefore (22) is upper bounded by

[TABLE]

Observing Eq. (23) is exactly $\mathbb{P}\left[\left|\frac{X_{n}}{n}-z\right|\geq\frac{\epsilon}{4}\right]$ , where $X_{n}\sim B(n,z)$ is binomially distributed. Using Chebyshev’s inequality we obtain

[TABLE]

Letting $n=2\left\lceil 4\epsilon^{-3}\right\rceil$ entails (22) is upper bounded by $\frac{\epsilon}{2}$ , yielding

[TABLE]

or equivalently by changing $z=\frac{1+t}{2}$ ,

[TABLE]

Denote $p(t)=\sum_{\nu=0}^{n}f\left(\frac{2\nu}{n}-1\right)b_{\nu,n}\left(\frac{1+t}{2}\right)$ . We shall now bound the coefficients of the approximating polynomial $p(t)$ . We have

[TABLE]

To upper bound the coefficients, observe that taking the absolute value of $f\left(\frac{2\nu}{n}-1\right)$ and substituting $1-t$ with $1+t$ will result in a polynomial with only positive coefficients, upper bounding the ones of $p(t)$ . Therefore

[TABLE]

Clearly, the coefficients of $\frac{1}{2}(1+t)^{n}$ are upper bounded by $2^{n}$ . Finally, consider the even polynomial

[TABLE]

Its even coefficients are equal to those of $p$ and are thus bounded by $2^{n}$ . Moreover, we have

[TABLE]

By virtue of $f$ being even we have $f(t)=\frac{1}{2}\left(f(t)+f(-t)\right)$ , and by Equations (24) and (25) we get for any $t\in[-1,1]$

[TABLE]

concluding the proof of the lemma. ∎

We are finally ready to prove Thm. 1.

Proof of Thm. 1.

From Lemma 4, we have an even polynomial $p(z)=\sum_{k=0}^{n/2}p_{2k}z^{2k}$ of degree $n=2\left\lceil 32\epsilon^{-3}\right\rceil$ , such that

[TABLE]

thus also

[TABLE]

Invoke Thm. 8 $\frac{n}{2}$ times to approximate each of $\left|\left|\mathbf{x}\right|\right|^{2},\left|\left|\mathbf{x}\right|\right|^{4},\ldots,\left|\left|\mathbf{x}\right|\right|^{n}$ to accuracy $\frac{\epsilon}{n2^{n}}$ , using $\frac{n}{2}$ depth $2$ networks $N_{k}$ , $k=1,2,\ldots,n/2$ , with $\sigma$ activations of width $w\leq c_{\sigma}\left(\frac{2nd2^{n}}{\epsilon}\right)^{\mathcal{O}(n^{2})}=c_{\sigma}\left(\frac{2d}{\epsilon}\right)^{\mathcal{O}(n^{3})}$ . Thus obtaining for any $k\in\left\{1,2,\ldots,n/2\right\}$

[TABLE]

Consider the depth $2$ $\sigma$ network $N$ concatenating the networks $p_{2k}\cdot N_{k}$ , having output bias of $p_{0}$ and having width

[TABLE]

We compute for any $\mathbf{x}\in B_{d}$

[TABLE]

From Equations (26) and (27), the above is upper bounded by

[TABLE]

The proof of Thm. 1 is complete. ∎

3.2 Proof of Thm. 3

Let $f(\mathbf{x})=\varphi(\left|\left|\mathbf{x}\right|\right|)$ be $1$ -Lipschitz on $B_{d}$ . By setting the bias term of the output neuron of the approximating depth $2$ network to $b_{0}=f(\mathbf{0})$ , we may assume w.l.o.g. that $f(\mathbf{0})=0$ to begin with. Moreover, since we do not care about the approximation attained on $\mathbb{R}^{d}\setminus B_{d}$ , we may set $f(\mathbf{x})=0$ for any $\mathbf{x}\in\mathbb{R}^{d}\setminus B_{d}$ .

Now, instead of uniformly approximating $f$ directly, we can approximate a smoothed $\epsilon/2$ -approximation of it attained by $g=f\star\gamma_{\epsilon^{2}/4d}$ , where $\star$ is the convolution operation and $\gamma_{\epsilon^{2}/4d}$ is the Gaussian density function with mean $\mathbf{0}$ and covariance matrix $\frac{\epsilon^{2}}{4d}I$ . Equivalently, we can define $g$ as

[TABLE]

where $\mathbf{z}$ is distributed according to $\gamma_{\epsilon^{2}/4d}$ . We note that this is a uniform $\epsilon/2$ approximation of $f$ , since

[TABLE]

Since smooth functions have well-behaved Fourier transforms, this will make the use of Thm. 2 much more convenient. We thus have that attaining a uniform $\epsilon/2$ -approximation of $g$ on $B_{d}$ will suffice to finish the proof.

We begin by upper bounding $\left|\left|f\right|\right|_{L_{1}}$ . Since $f(\mathbf{0})=0$ and $f$ is $1$ -Lipschitz, we have that $\left|\left|f\right|\right|_{L_{1}}\leq\int_{B_{d}}d\mathbf{x}=\mu_{d}(B_{d})$ , where $\mu_{d}$ denotes the $d$ -dimensional Lebesgue measure. Consequentially, since an $L_{1}$ upper bound implies a similar upper bound on the $L_{\infty}$ norm of the Fourier transform, we have that $\hat{f}(\omega)\leq\mu_{d}(B_{d})$ for any $\omega\in\mathbb{R}^{d}$ . Since the Fourier transform of a Gaussian pdf is another Gaussian with inverse variance, we have from the convolution-multiplication theorem that $\hat{g}(\omega)=\hat{f}(\omega)\exp\left(-\epsilon^{2}\left|\left|\omega\right|\right|^{2}/8d\right)$ , for all $\omega\in\mathbb{R}^{d}$ . We thus compute

[TABLE]

where Eq. (28) is due to the absolute moments of a normal variable $X$ with mean [math] and standard deviation $\sigma$ satisfying $\mathbb{E}\left[\left|X\right|^{d+1}\right]=\sigma^{d+1}\frac{2^{(d+1)/2}\Gamma\left(d/2+1\right)}{\sqrt{\pi}}$ (see Winkelbauer (2012, Eq. (18))), Eq. (29) is due to $\mu_{d}(B_{d})=\frac{\pi^{d/2}}{\Gamma(d/2+1)}$ and $\mu_{d-1}(\mathbb{S}_{1}^{d-1})=\frac{2\pi^{d/2}}{\Gamma(d/2)}$ , and Eq. (30) is due to the inequality $\Gamma(z)\geq\left(\frac{z}{e}\right)^{z-1}$ .

We now split our analysis into two cases, depending on the value of $c$ , the constant guaranteed from Thm. 2. In both cases we will need the following:

Claim 1.

We have

[TABLE]

The claim is a straightforward result derived by computing the partial derivatives of the left hand side and showing it is monotonically decreasing for any $x$ and $d$ in its domain, and therefore its proof is omitted.

Begin with assuming $c\leq 1$ . Then substituting $n=8dv_{g,2}^{3}/\epsilon^{3}$ in Eq. (1), we get

[TABLE]

where the last inequality is due to Claim 1 and the assumption that $v_{g,2}/\epsilon\geq 1.5$ , which will always hold for small enough $\epsilon>0$ since $v_{g,2}$ is always finite and positive for a non-constant $f=g\star\gamma_{\epsilon^{2}/4d}$ .

For the second case, assume $c>1$ . Then choosing $n=8dc^{3}v_{g,2}^{3}/\epsilon^{3}$ we similarly have

[TABLE]

where likewise, the last inequality uses Claim 1 and the assumptions that $v_{g,2}/\epsilon\geq 1.5$ and $c>1$ .

We conclude using Eq. (31) that $g$ can be $\epsilon/2$ -approximated using a depth $2$ ReLU network of width

[TABLE]

completing the proof of Thm. 3.

3.3 Proof of Thm. 4

Our proof essentially reduces the assumptions in the theorem statement to those of Daniely (2017, Example 2), who showed that any depth $2$ ReLU network which approximates the non-radial function $\sin\left(\pi d^{3}\langle\mathbf{x}_{1},\mathbf{x}_{2}\rangle\right)$ to an expected accuracy of at most $\frac{1}{50\exp(2)\pi^{2}}$ with respect to the uniform distribution on $\mathbb{S}^{d-1}\times\mathbb{S}^{d-1}$ , while having weights bounded by $2^{d}$ , necessarily has width at least $2^{\Omega(d\log d)}$ .

Suppose that $f$ is approximable to accuracy $\epsilon$ using a depth $2$ network $N$ of width $w(d,1/\epsilon)$ , having weights bounded by $\frac{2^{d+1}}{2\pi d^{3}}$ . i.e. suppose that

[TABLE]

Then in particular, we can choose $\epsilon=\frac{1}{101\exp(2)\pi^{3}d^{3}}$ to have a width $w(d,101\exp(2)\pi^{3}d^{3})$ network satisfying

[TABLE]

Now, let $\tilde{f}(\mathbf{x})=2\pi d^{3}f(\mathbf{x})$ , and let $\tilde{N}(\mathbf{x})=2\pi d^{3}N(\mathbf{x})$ , which is also a depth $2$ neural network with weights bounded by $2^{d+1}$ , since the scaling factor of $2\pi d^{3}$ can be simulated by multiplying the weights of the output neuron of $N$ by $2\pi d^{3}$ . We have using Eq. (32) that

[TABLE]

By taking a network $N^{\prime}$ which is identical to $\tilde{N}$ except for having its first layer weights (excluding bias terms) halved, i.e. bounded by $2^{d}$ , we have

[TABLE]

which implies that

[TABLE]

as well as

[TABLE]

since further restricting the domain cannot increase the supremum. Now, observe that

[TABLE]

Note that approximating a function $f$ and approximating its additive inverse $-f$ using a neural network is equivalent (simply invert the weights of the output neuron), thus we can w.l.o.g. ignore the $(-1)^{d}$ term in the above. Plugging Eq. (34) in Eq. (33) we obtain

[TABLE]

Finally, let $N^{\prime\prime}:\mathbb{R}^{2d}\to\mathbb{R}$ be the network obtained from $N^{\prime}$ by duplicating its first layer weights excluding biases, i.e. $\mathbf{w}_{i}\mapsto(\mathbf{w}_{i},\mathbf{w}_{i})$ , thus we have that $N^{\prime\prime}((\mathbf{x}_{1},\mathbf{x}_{2}))=N^{\prime}(\mathbf{x}_{1}+\mathbf{x}_{2})$ for any $\mathbf{x}_{1},\mathbf{x}_{2}\in\mathbb{S}^{d-1}$ . Plugging this in Eq. (35) we obtain

[TABLE]

That is, Eq. (36) establishes the existence of a width $w(d,101\exp(2)\pi^{3}d^{3})$ , depth $2$ ReLU network having weights bounded by $2^{d}$ , which uniformly approximates $\sin\left(\pi d^{3}\langle\mathbf{x}_{1},\mathbf{x}_{2}\rangle\right)$ on $\mathbb{S}^{d-1}\times\mathbb{S}^{d-1}$ (and in particular, provides such expected accuracy with respect to the uniform distribution on $\mathbb{S}^{d-1}\times\mathbb{S}^{d-1}$ ). By Daniely (2017, Example 2), this implies that $w(d,101\exp(2)\pi^{3}d^{3})\geq 2^{\Omega(d\log d)}$ , concluding the proof of Thm. 4

3.4 Proof of Thm. 5

In this proof, we utilize the measure on $\mathbb{R}^{d}$ used in Eldan and Shamir (2016) for their lower bound, whose density is given by the square of

[TABLE]

where $R_{d}=\frac{1}{\sqrt{\pi}}\left(\Gamma\left(\frac{d}{2}+1\right)\right)^{1/d}$ , and $J_{\nu}(z)$ is the Bessel function of the first kind, of order $\nu$ (see reference above for further information about these functions).

Suppose we have $N$ as in the theorem statement. Define $h_{r}(z)=\mathbbm{1}\left\{z\leq r\right\}$ and

[TABLE]

for some parameter $r$ . That is, $f_{r,\delta}$ is a $\frac{1}{\delta}$ -Lipschitz approximation of the indicator $z\mapsto\mathbbm{1}\left\{z\leq r\right\}$ . To establish Thm. 5, we shall fix $C=\frac{\epsilon^{2}\sqrt{d}}{5.2}$ and consider the measure $\mu$ with density $\gamma^{\prime}(\mathbf{x})=(\beta\alpha)^{d}\varphi^{2}(\beta\alpha\mathbf{x})$ used in Safran and Shamir (2017, Thm. 1), for some $\beta\in[1,2]$ and where $\alpha\geq 1$ is the universal constant from Eldan and Shamir (2016). We will show that

[TABLE]

and that there exists a depth $2$ network $N^{\prime}$ which is based on $N$ , having width $2\cdot w\left(d,c\epsilon^{-3}\right)$ for some universal $c>0$ , and satisfying

[TABLE]

This would imply Thm. 5, since Equations (37) and (38) yield

[TABLE]

and by plugging $\epsilon=c_{0}/d^{2}$ for some universal constant $c_{0}>0$ , we have from Safran and Shamir (2017, Thm. 1, Eq. (4)) that for any such depth $2$ neural network approximation of a ball indicator of radius $r=\sqrt{d}$ w.r.t. the measure $\mu$ with density $\gamma^{\prime}$ satisfying

[TABLE]

it must hold that the width of $N$ satisfies $w(d,c_{2}d^{6})\geq c_{3}\exp(c_{4}d)$ for any $d>c_{1}$ , some $c_{2}>0$ and small enough $c_{3},c_{4}>0$ .

We begin by proving Eq. (37). Following a similar approach as in Eldan and Shamir (2016, Lemma 7). We have by definition that

[TABLE]

Changing to polar coordinates, where $A_{d}=\frac{d\pi^{d/2}}{\Gamma(\frac{d}{2}+1)}$ denotes the volume of the unit hypersphere in $\mathbb{R}^{d}$ , the above equals

[TABLE]

Using the definition of $R_{d}$ , this equals

[TABLE]

where the inequality is due to the definitions of $f_{r,C},h_{r}$ , since both functions are identical on $\left[0,\infty\right)$ except for the interval $\left[\sqrt{d},\sqrt{d}+C\right]$ , where they deviate from each other by at most $1$ . Moreover, for such $z$ in the integration interval we have that Lemma 14 from Eldan and Shamir (2016) applies; therefore, the integral is upper bounded by

[TABLE]

where Eq. (37) follows by taking the square root.

Moving to Eq. (38), suppose we have $N$ as in the theorem statement, approximating $f$ to accuracy $\frac{\epsilon}{4(C^{-1}r+1)}=\Theta(\epsilon^{3})$ . Namely, we have a depth $2$ network of width $w\left(d,c\epsilon^{-3}\right)$ such that

[TABLE]

for some constant $c>0$ and some density $\gamma$ . Let $A$ be some invertible matrix to be determined later, and consider the change of variables $\mathbf{y}=A\mathbf{x}\iff\mathbf{x}=A^{-1}\mathbf{y}$ , $d\mathbf{x}=\left|\det\left(A^{-1}\right)\right|\cdot d\mathbf{y}$ , which yields

[TABLE]

In particular, we may choose $\gamma(\mathbf{z})=\left|\det\left(A\right)\right|\cdot\gamma^{\prime}(A\mathbf{z})$ (note that this indeed defines a measure as readily seen by the change of variables $\mathbf{x}=A\mathbf{z}$ , $d\mathbf{x}=\left|\det\left(A\right)\right|d\mathbf{z}$ , yielding $\int_{\mathbf{z}}\gamma(\mathbf{z})d\mathbf{z}=\int_{\mathbf{x}}\gamma^{\prime}(\mathbf{x})d\mathbf{x}=1$ ). Plugging the chosen $\gamma$ in Eq. (39) we obtain

[TABLE]

Observing that for any $A$ , $N(A^{-1}\mathbf{y})$ expresses a linear transformation of the input which can be simulated by an appropriate modification of the weights in the hidden layer of $N$ , we choose $A=(r+C)\cdot I_{d}$ and $A=r\cdot I_{d}$ , where $I_{d}$ is the $d\times d$ identity matrix, to obtain

[TABLE]

and

[TABLE]

Now, consider the network $N^{\prime}$ given by

[TABLE]

Note that this is indeed a depth $2$ network of width $2\cdot w\left(d,c\epsilon^{-3}\right)$ as a linear combination of depth $2$ networks. We will show that this network approximates

[TABLE]

Compute taking the square roots of Equations (41) and (42) to obtain

[TABLE]

implying Eq. (38), and concluding the proof of Thm. 5.

Acknowledgements

This research is supported in part by European Research Council (ERC) Grant 754705.

Appendix A Trading-Off $L,\epsilon$ and Radius of

Support

In this appendix, we formally show that given an inapproximability result for neural networks, using an $L$ -Lipschitz function, w.r.t. to some distribution with support of radius $r$ and accuracy $\epsilon$ , it is easy to get an inapproximability result even for $1$ -Lipschitz functions, at the cost of scaling either $\epsilon$ or $r$ polynomially in $L$ :

Theorem 9.

Let $f$ be an $L$ -Lipschitz function on $\mathbb{R}^{d}$ , and $\mu$ a measure over $\mathbb{R}^{d}$ with support bounded in $\{\mathbf{x}:\left|\left|\mathbf{x}\right|\right|\leq r\}$ for some $r\leq\infty$ . Suppose that

[TABLE]

where $\mathcal{N}$ is some class of functions closed under scaling (namely, if $n\in\mathcal{N}$ , then $\mathbf{x}\mapsto a\cdot n(b\mathbf{x})$ for any $a,b>0$ is also in $\mathcal{N}$ ).

Define the $1$ -Lipschitz function $\tilde{f}(\mathbf{x}):=\frac{1}{L}f(\mathbf{x})$ . Then it holds that

[TABLE] 2. 2.

Define the $1$ -Lipschitz function $\hat{f}(\mathbf{x}):=f\left(\frac{1}{L}\mathbf{x}\right)$ , and the measure $\hat{\mu}$ by $\hat{\mu}(A):=\mu\left(\frac{1}{L}A\right)$ for any set $A$ in the $\sigma$ -algebra of $\mu$ (where $\frac{1}{L}A:=\{\frac{1}{L}\mathbf{x}:\mathbf{x}\in A\}$ and assuming this set is also in the $\sigma$ -algebra of $\mu$ ). Then $\hat{\mu}$ has a support bounded in $\{\mathbf{x}:\left|\left|\mathbf{x}\right|\right|\leq rL\}$ , and

[TABLE]

Proof.

By the assumptions, we have $\inf_{n\in\mathcal{N}}\mathbb{E}_{\mathbf{x}\sim\mu}\left[\left(\frac{1}{L}n(\mathbf{x})-\frac{1}{L}f(\mathbf{x})\right)^{2}\right]~{}\geq~{}\frac{\epsilon}{L^{2}}$ , so the first part follows from definition of $\tilde{f}$ and the fact that $\mathcal{N}$ is closed under scaling. As to the second part, the assertion on the support of $\hat{\mu}$ is immediate, and we have

[TABLE]

which is at least $\epsilon$ by our assumptions and the fact that $\mathcal{N}$ is closed under scaling. ∎

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abbe and Sandon [2018] E. Abbe and C. Sandon. Provable limitations of deep learning. ar Xiv preprint ar Xiv:1812.06369 , 2018.
2Andrews et al. [1999] G. E. Andrews, R. Askey, and R. Roy. Special functions, volume 71 of encyclopedia of mathematics and its applications, 1999.
3Barron [1993] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory , 39(3):930–945, 1993.
4Boucheron et al. [2005] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics , 9:323–375, 2005.
5Cohen et al. [2015] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: a tensor analysis. ar Xiv preprint ar Xiv:1509.05009 , 556, 2015.
6Daniely [2017] A. Daniely. Depth separation for neural networks. ar Xiv preprint ar Xiv:1702.08489 , 2017.
7Delalleau and Bengio [2011] O. Delalleau and Y. Bengio. Shallow vs. deep sum-product networks. In NIPS , pages 666–674, 2011.
8Eldan and Shamir [2016] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In 29th Annual Conference on Learning Theory , pages 907–940, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Depth Separations in Neural Networks:

Abstract

1 Introduction

2 Main Results

2.1 Approximation with Width poly(d)\text{poly}(d)poly(d) Networks

Assumption 1**.**

Theorem 1**.**

2.2 Approximation with Width poly(1/ϵ)\text{poly}(1/\epsilon)poly(1/ϵ) Networks

Theorem 2** (Klusowski and Barron (2018)).**

Theorem 3**.**

2.3 Impossibility to Approximate with Width poly(d,1/ϵ)\text{poly}(d,1/\epsilon)poly(d,1/ϵ)

Theorem 4**.**

Theorem 5**.**

3 Proofs

3.1 Proof of Thm. 1

Theorem 6**.**

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Proof of Thm. 6.

Theorem 7**.**

Proof of Thm. 7.

Theorem 8**.**

Lemma 3**.**

Proof.

Proof of Thm. 8.

Lemma 4**.**

Proof.

Proof of Thm. 1.

3.2 Proof of Thm. 3

Claim 1**.**

3.3 Proof of Thm. 4

3.4 Proof of Thm. 5

Acknowledgements

Appendix A Trading-Off L,ϵL,\epsilonL,ϵ and Radius of

Theorem 9**.**

Proof.

2.1 Approximation with Width $\text{poly}(d)$ Networks

Assumption 1.

Theorem 1.

2.2 Approximation with Width $\text{poly}(1/\epsilon)$ Networks

Theorem 2 (Klusowski and Barron (2018)).

Theorem 3.

2.3 Impossibility to Approximate with Width $\text{poly}(d,1/\epsilon)$

Theorem 4.

Theorem 5.

Theorem 6.

Lemma 1.

Lemma 2.

Theorem 7.

Theorem 8.

Lemma 3.

Lemma 4.

Claim 1.

Appendix A Trading-Off $L,\epsilon$ and Radius of

Theorem 9.