Depth Separations in Neural Networks: What is Actually Being Separated?
Itay Safran, Ronen Eldan, Ohad Shamir

TL;DR
This paper investigates depth separation in neural networks for Lipschitz radial functions, showing that depth 2 networks can approximate such functions efficiently, challenging previous assumptions about the necessity of greater depth.
Contribution
The paper demonstrates that depth 2 networks can approximate -Lipschitz radial functions with polynomial size in dimension and inverse accuracy, contradicting prior depth separation results.
Findings
Depth 2 networks can approximate -Lipschitz radial functions with polynomial size in dimension.
Approximation is also possible with size polynomial in 1/psilon for fixed dimension.
Simultaneous polynomial dependence on both dimension and inverse accuracy is impossible.
Abstract
Existing depth separation results for constant-depth networks essentially show that certain radial functions in , which can be easily approximated with depth networks, cannot be approximated by depth networks, even up to constant accuracy, unless their size is exponential in . However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension (or equivalently, by scaling the function, the hardness result applies to -Lipschitz functions only when the target accuracy is at most ). In this paper, we study whether such depth separations might still hold in the natural setting of -Lipschitz radial functions, when does not scale with . Perhaps surprisingly, we show that the answer is negative: In contrast to the…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Depth Separations in Neural Networks:
What is Actually Being Separated?111Accepted for presentation at the Conference on Learning Theory (COLT) 2019
Itay Safran Ronen Eldan Ohad Shamir
Weizmann Institute of Science
{itay.safran,ronen.eldan,ohad.shamir}.weizmann.ac.il
Abstract
Existing depth separation results for constant-depth networks essentially show that certain radial functions in , which can be easily approximated with depth networks, cannot be approximated by depth networks, even up to constant accuracy, unless their size is exponential in . However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension (or equivalently, by scaling the function, the hardness result applies to -Lipschitz functions only when the target accuracy is at most ). In this paper, we study whether such depth separations might still hold in the natural setting of -Lipschitz radial functions, when does not scale with . Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it is possible to approximate -Lipschitz radial functions with depth , size networks, for every constant . We complement it by showing that approximating such functions is also possible with depth , size networks, for every constant . Finally, we show that it is not possible to have polynomial dependence in both simultaneously. Overall, our results indicate that in order to show depth separations for expressing -Lipschitz functions with constant accuracy – if at all possible – one would need fundamentally different techniques than existing ones in the literature.
1 Introduction
In the past few years, quite a few theoretical works have explored the beneficial effect of depth on increasing the expressiveness of neural networks (e.g., Delalleau and Bengio (2011); Martens et al. (2013); Martens and Medabalimi (2014); Montufar et al. (2014); Cohen et al. (2015); Telgarsky (2016); Eldan and Shamir (2016); Liang and Srikant (2016); Poggio et al. (2016); Poole et al. (2016); Shaham et al. (2016); Yarotsky (2016); Daniely (2017); Safran and Shamir (2017)). These works mostly focus on depth separations: namely showing that there are functions which can be expressed by a small network of a given depth, but cannot be approximated by shallower networks, even if their size is much larger. Perhaps the clearest manifestation of this is in separating depth and depth networks: There are functions and distributions on , which are
- •
Hard to approximate with a depth network: for some absolute , using any depth , width network (for some parameters and univariate activation function ).
- •
Easy to approximate with a depth network: For any , it holds that (or sometimes even ) for some depth , width neural network network (where each is a depth , width network, and is a standard activation such as a ReLU).
Eldan and Shamir (2016) (as well as a related construction in Safran and Shamir (2017)) prove such a lower bound unconditionally, whereas Daniely (2017) show this with a simple proof, assuming that the parameters of the network cannot be too large. Moreover, these “hard” functions have a simple form: They are essentially radial functions222Eldan and Shamir (2016) use a radial function. Daniely (2017) use a function which is easily reduced to a radial one – see next paragraph. of the form for a univariate function . Such radial functions are of interest in learning theory, since there are function classes that are essentially a mixture of radial functions (e.g. Gaussian kernels), and they are essential primitives in expressing functions which involve Euclidean distances. The intuition for the above separations is that radial functions can be easily approximated with depth networks, by first approximating the function in the first layer, and then approximating the univariate function in the next layers. In contrast, approximating high-dimensional radial functions with depth networks appears to be difficult, since they are, in a sense, the furthest away from functions which depend on only a single direction (see Fig. 1). Overall, these results appear to provide a clear separation between the required widths of depth and depth networks, in terms of the dimension .
However, a closer inspection of the constructions above reveals that in fact, this is not so clear. The reason is that the functions which are shown to be provably hard for depth networks are rapidly oscillating, and require a Lipschitz constant (at least) polynomial in to even approximate: In Eldan and Shamir (2016), the function has the form , over a distribution supported on (where , is the indicator function, and are disjoint intervals in the range ). In Daniely (2017), the function used is easily reduced to (see proof of Thm. 4). Having such rapidly oscillating functions is not always a natural regime, since we are often interested in functions whose Lipschitz parameter is independent of the dimension. For example, in learning theory, this is actually needed to obtain dimension-free learnability results for convex functions (Shalev-Shwartz and Ben-David, 2014, Chapter 12). Moreover, there is evidence that functions which oscillate too rapidly can be computationally difficult for neural networks to learn with standard gradient-based methods (e.g., (Song et al., 2017; Shalev-Shwartz et al., 2017; Shamir, 2018; Abbe and Sandon, 2018)), so Lipschitz functions are arguably more interesting from a learning perspective. Overall, we are lead to the following natural question:
Can we show a depth vs. separation result in terms of the dimension , even for approximating -Lipschitz functions?
In other words, are there -Lipschitz functions which cannot be approximated by depth , width- networks, but can be approximated by depth , width- networks?
To study this, we first notice that it is easy to reduce any hardness result for approximating -Lipschitz functions to accuracy , to hardness of approximating -Lipschitz functions to accuracy , simply by scaling the functions by . Moreover, we can even reduce the hardness result to a -Lipschitz function with accuracy , by dilating the measure we are using by a factor (see Appendix A for a formal statement). However, now the lower bounds require either the accuracy or the diameter of the support of the distribution to scale polynomially with . As a result, when saying that the depth networks require width super-polynomial in , it is not clear whether the hardness really comes from the dimension , or perhaps from other parameters which are being forced to scale with it, such as the accuracy . Thus, we rephrase our question as follows:
Can we show a depth vs. depth neural network separation result in terms of the dimension , for approximating -Lipschitz functions up to constant accuracy on a domain of bounded radius (all independent of )?
The intuition described earlier (on the difficulty of approximating radial functions in high dimensions) seems to suggest that the answer is positive.
Our Results. In this paper, we show that perhaps surprisingly, the answer to the question above is actually negative (at least for radial functions): For any constant , it is possible to approximate radial functions using -width, depth networks. More precisely, our upper bound on the required size is (see Thm. 1). We also complement this by showing that for constant dimension , approximation of any -Lipschitz radial function is possible with -width, depth networks: Specifically, the bound is (see Thm. 3). Both bounds are -type approximation results, with respect to the unit ball: Namely, given a function , we show how to find a neural network such that
[TABLE]
where . This is a stronger approximation guarantee than -type approximation guarantees (where we bound for some distribution on ), since a bound on the former implies a bound on the latter. Furthermore, we show that any even radial monomial, namely a radial function of the form , for any fixed natural , can be approximated to accuracy using a depth network of width polynomial in both and . Finally, we formally prove (using a reduction from Eldan and Shamir (2016); Daniely (2017), and using their assumptions) that it is impossible to obtain a general polynomial dependence on both and in our setting (see Thm. 4 and Thm. 5). Overall, these results show that to approximate radial functions with depth networks, their width can be polynomial in either or , but generally not in both. Putting this in the context of known depth separations, our results indicate that the difficulty in approximating the “hard” functions used in separating depth 2 from depth 3 stems from both the input dimension and the accuracy parameter simultaneously, and not from either one alone.
It is interesting to note that such trade-offs between dimension and accuracy also appear in very different areas of learning theory. For example, consider the classic problem of agnostically learning halfspaces up to excess error in dimensions (see for example Kalai et al. (2008)): It is folklore that for well-behaved input distributions, one can learn a halfspace in runtime for constant (simply by creating an -net of all possible halfspaces, and picking the best one on a training data). On the other hand, it is also known that one can learn in runtime for any constant , at least for certain input distributions (Kalai et al., 2008). However, there is evidence that being polynomial over both is not possible in those settings (Klivans and Kothari, 2014).
Finally, we emphasize that our results still do not fully settle the question stated above, since there might be depth separation results using functions which cannot be reduced to radial ones. However, such results do not exist at the present time, and we believe that our observations may also be relevant for more general families of functions. In any case, we hope our paper would motivate and guide further study of this question.
2 Main Results
In this section, we present our main results and the high-level proof components. The remainder of the proofs are provided in Sec. 3.
2.1 Approximation with Width Networks
We first present our formal result, implying that radial functions can be approximated with depth , width networks, to any constant accuracy . We prove this result for networks employing any activation function which satisfies the following mild assumption (taken from Eldan and Shamir (2016)), which implies that the activation can be used to approximate univariate functions well. This assumption is satisfied for all standard activations, such as ReLU and sigmoidal functions (see reference above for further discussion):
Assumption 1**.**
Given the activation function , there is a constant (depending only on ) such that the following holds: For any -Lipschitz function which is constant outside a bounded interval , and for any , there exist scalars , where , such that the function
[TABLE]
satisfies
[TABLE]
Our main result for this subsection is the following:
Theorem 1**.**
Suppose satisfies Assumption 1. Then for any and any -Lipschitz radial function , there exists a depth neural network with activations and width satisfying
[TABLE]
where the big O notation hides a constant that depends solely on .
We note that the exponent might be improvable to some smaller polynomial in (see proof for details), but overall the dependence on remains at least exponential.
The proof of this theorem requires several intermediate results about the approximation capabilities of depth networks, some of which may be of independent interest. The high-level strategy is the following:
- •
First, we consider depth networks , where is the exponential function. Using properties of the beta distribution, we show that if the weights are drawn uniformly and independently from the unit sphere (and are fixed appropriately), then the resulting network satisfies for some complicated function , which depends however only on the norm of . Using concentration of measure, we show that the above implies if the width is sufficiently large (Thm. 6).
- •
Next, we use Assumption 1 to show that we can construct a bounded-width network with any -activation (not just an exponential one), such that (Thm. 7).
- •
Using a Taylor series argument, we show that a careful linear combination of (not too many) scaled versions of allow us to approximate any even monomial in the norm of . Since a linear combination of depth networks is still a depth network, this implies that we can approximate with some depth network, again with bounded width (Thm. 8).
- •
Finally, we use a quantitative version of Weierstrass’ approximation theorem, to show that we can approximate any Lipschitz radial function (where is on ) by a linear combination of even monomials (Lemma 4). Again, this implies that we can find a bounded-width depth network which approximates this radial function well.
2.2 Approximation with Width Networks
Having considered depth , width networks (for constant accuracy ), we now turn to consider the complementary setting, where the dimension is fixed, and we show how Lipschitz radial functions can be approximated by width networks. This setting is closer in spirit to universal approximation theorems for depth networks (namely, on how such networks can approximate any continuous function on a compact domain, if we allow exponential dependencies on ). Unfortunately, most such theorems are not quantitative in nature, and do not imply polynomial dependence on . A noteworthy exception is the line of work pioneered by Barron (see Barron (1993)), which provide quantitative approximation guarantees in terms of the width and moments of the Fourier transform of the target function . Our main technical contribution here is to show how we can translate such moment-based bounds to a bound applicable to any Lipschitz radial function. For concreteness, we will focus here on networks employing the common ReLU activations (i.e. ), although the technique is applicable more generally. We make use of the following recent result from Klusowski and Barron (2018, Theorem 2), which provides an approximation guarantee for ReLU networks:
Theorem 2** (Klusowski and Barron (2018)).**
Let . Suppose admits a Fourier representation and
[TABLE]
Then there exist depth ReLU networks , each of width such that for all
[TABLE]
for some universal constant .
Note that in their original theorem statement, Klusowski and Barron (2018) define the ReLU networks as having an additional linear term, which we for convenience write as a sum of two ReLU neurons and thus omit it from the theorem statement.
We now turn to formally state the main result of this subsection.
Theorem 3**.**
Suppose is a -Lipschitz radial function on . Then there exists a depth ReLU neural network , of width such that
[TABLE]
The proof (in Sec. 3) utilizes Thm. 2, with the main challenge being that even for a -Lipschitz radial , the coefficient might be unbounded. Instead, we consider a smoothed approximation , where is the convolution operation and is the Gaussian pdf with mean and covariance matrix . Since is Lipschitz, this function is -close to at any point . Therefore, to approximate well, it is sufficient to approximate well. Moreover, since represents a convolution with a smooth function, then it is smooth, and therefore its Fourier transform has a rapidly decaying tail. This implies that the coefficient is bounded (in a manner exponential in but polynomial in ), and an application of Thm. 2 implies the result.
2.3 Impossibility to Approximate with Width
Networks
In this subsection, we complement our previous positive approximation results with negative results. Specifically, we provide two lower bounds, which imply that there are -Lipschitz radial functions, which cannot be approximated to accuracy on the unit ball , using depth , width networks (see Fig. 2). In a sense, this was already shown in Daniely (2017); Eldan and Shamir (2016), as discussed in the introduction. However, a bit of work is needed to apply them to our setting: For example, the result in Eldan and Shamir (2016) is for a radial function, but not a Lipschitz one, and the result in Daniely (2017) is not for a radial function.
Since our results are based on reductions from these papers, we need to make similar assumptions. In particular, we need to require either having an approximation on an unbounded domain, or that the approximating network’s parameters are at most exponential in . To the best of our knowledge, it remains a major open problem to prove a depth separation result without either of these two assumptions (namely, on a compact domain such as , and without restrictions on the magnitude of the parameters).
Theorem 4**.**
The following holds for some positive universal constants , and any depth network employing a ReLU activation function. Consider the -Lipschitz function on . Suppose is a depth network of width , with weights bounded by , and satisfying for any and any . Then for any ,
[TABLE]
In particular, depth networks of width cannot approximate to accuracy .
We remark that the impossibility result provided in the theorem above is in terms of -type approximation, namely rather than . This is for simplicity and to make the setting complementary to our positive results from earlier (however, extending it to approximation results is not too difficult).
Theorem 5**.**
The following holds for some positive universal constants , and any network employing an activation function satisfying Assumptions 1 and 2 in Eldan and Shamir (2016). Let . For any , there exists a continuous probability distribution on , such that for any , and any depth neural network satisfying and having width , it must hold that
[TABLE]
In particular, depth networks of width cannot approximate to accuracy .
3 Proofs
3.1 Proof of Thm. 1
We begin by stating the following theorem, which establishes the capability of exponential networks to approximate a particular radial function, which we denote by . Our construction for approximating uses random weights, resulting in a random network which is significantly easier to analyze when exponential activations are considered (basically, since the exponent of a random variable is its moment generating function, which for many distributions is well-known and studied).
Theorem 6**.**
For an integer , define . For any and natural there exists an exponential depth neural network on , of width , hidden layer weights satisfying (the unit sphere), and , such that
[TABLE]
where for all ,
[TABLE]
The proof of Thm. 6 relies on the observation that by drawing uniformly from the unit sphere, the neuron has an expected value equal to . Setting for all , and sampling each independently, we have from concentration of measure that the resulting network gradually converges to this expected value, effectively approximating . Before we prove Thm. 6 however, we would need to evaluate the distribution of the dot product of such a random neuron with its input, as well as derive an equivalent representation of which we will encounter when proving the theorem. To this end, we have the following two lemmas:
Lemma 1**.**
Suppose such that , and suppose is distributed uniformly on the -dimensional unit sphere. Then the random variable follows a distribution.
Proof.
Since is invariant to orthogonal transformations, we may assume w.l.o.g. that is of the form . That is, , where is the first coordinate of . Therefore to determine the distribution of , it suffices to compute the probability of falling in the interval for , or equivalently, falling in the interval . Since is distributed uniformly on the unit sphere, this is proportional to the area of a hyperspherical cap centered at , and defined by . This probability is given in terms of the regularized incomplete beta function as
[TABLE]
(Leopardi, 2007, Lemma 2.3.15.), where the spherical radius of the cap, , satisfies . Elementary trigonometry reveals that under this condition, it must hold that , namely we have
[TABLE]
implying that
[TABLE]
i.e., by the change of variables we have
[TABLE]
It follows immediately that is distributed, concluding the proof of the lemma. ∎
Lemma 2**.**
We have
[TABLE]
Proof.
Letting , we compute
[TABLE]
Since for even and for odd , it suffices to show that for any natural and any integer ,
[TABLE]
We rewrite
[TABLE]
where
[TABLE]
is the Gauss hypergeometric function. Using Euler’s integral formula for the Gauss hypergeometric function (Andrews et al., 1999, p. 65, Theorem 2.2.1) yields
[TABLE]
Simplifying the integral in Eq. (3), we substitute , to get
[TABLE]
Clearly, the integrand in Eq. (4) is an odd function when is odd, therefore for any odd . For even , integration by parts of and reveals that
[TABLE]
Recursively applying the relation in Eq. (5) yields
[TABLE]
Substituting back in the integral in Eq. (6) gives
[TABLE]
where denotes the Beta function. Finally, substituting our calculations from Equations (7,6,4,3) in Eq. (2), and using the identities which holds for any real , and which holds for any integer , we have
[TABLE]
∎
We are now ready to prove Thm. 6.
Proof of Thm. 6.
Consider a depth network of width , where is to be determined later, with exponential activations, [math] bias terms in the hidden layer, equal weights of in the output neuron, and where the weights of each hidden neuron are sampled i.i.d. uniformly at random from the unit hypersphere . Fix such that , then we have from Lemma 1 that the network computes the random function
[TABLE]
where are i.i.d. Taking expectation in Eq. (8) yields.
[TABLE]
Letting gives
[TABLE]
Conveniently, the expectation in the right hand side of Eq. (9) is exactly the moment generating function of a random variable, given by
[TABLE]
(Gupta and Nadarajah, 2004). By virtue of Lemma 2, Eq. (9) therefore reduces to
[TABLE]
To convert the above expectation equality to a uniform convergence bound we shall use a Rademacher complexity argument. We have that the approximation error is
[TABLE]
This is equivalent to bounding the uniform convergence of the function class , whose values are bounded in . By standard Rademacher complexity arguments, it is well-known that this is upper bounded by with probability at least . Specifically, letting , we can rewrite Eq. (11) as
[TABLE]
Defining the function class , we can upper bound the above (with probability at least over the sampling of ) by
[TABLE]
where is the (empirical) Rademacher complexity of , and the expectation is over which are sampled independently and uniformly from (see Boucheron et al. (2005, Theorem 3.2)). Since takes values in , is -Lipschitz in that domain and , we can upper bound the above by
[TABLE]
(see Boucheron et al. (2005, Theorem 3.3)). Finally, since consists of -Lipschitz linear functions over the unit ball, we have that (see Boucheron et al. (2005, Corollary 4.3)). Overall, we get that Eq. (11) is at most . Picking , this can be upper bounded by . In particular, this means that there exist some realizations of such that Eq. (11) is at most . In other words, for any , if we set , we have a depth Linear network of width which approximates up to error .
∎
Albeit useful for establishing Thm. 6, exponential activations are uncommon in practice. To translate Thm. 6 to work with more commonly used activations, we utilize the universality of activations satisfying Assumption 1 to approximate an exponential function on a bounded domain to arbitrary accuracy, resulting in a network approximating for a wide family of activation functions. More formally, we have the following theorem:
Theorem 7**.**
Suppose is an activation satisfying Assumption 1. Then for any and natural there exists a depth neural network with activations of width at most , satisfying , where depends solely on .
Proof of Thm. 7.
First, invoke Thm. 6 to obtain a width exponential network satisfying
[TABLE]
Next, using Assumption 1, we obtain a depth network approximating the exponential on the unit interval , having width at most . Denote this network as , we construct a network approximating as follows: For each hidden weight of , we take a copy of and feed it with to obtain . Note that is a depth network since the linear transformation can be simulated by modifying the hidden layer of to compute it exactly. Defining the network , which is also a depth network of width as a weighted combination of networks (absorbing any absolute constants into ). We now compute using Eq. (12) for any
[TABLE]
Where we note that the boundedness of the weights of the hidden layer of and the Cauchy-Schwarz inequality guarantee that we remain in the relevant approximation domain of , as . ∎
Thm. 7 allows us to approximate the family of functions efficiently using depth networks with a variety of activations. The following theorem utilizes Thm. 7 to approximate even radial monomials. i.e., radial functions of the form for some natural .
Theorem 8**.**
Suppose satisfies Assumption 1. Then for any and any natural , there exists a depth neural network with activations of width satisfying
[TABLE]
where the big O notation hides a constant that depends solely on .
Interestingly, apart from its role in proving Thm. 1, Thm. 8 also shows the existence of a family of functions that are approximable to accuracy using width polynomial in both and ; for any fixed , the radial polynomial can be approximated by a width network. Before delving into the proof of Thm. 8, however, we will first need the following lemma, which will utilize the power-series representation of to approximate polynomials of even degree. Note that at this point, the question of approximation is now reduced to a one dimensional problem, since approximating a radial using linear combinations of is equivalent to approximating using linear combinations of .
Lemma 3**.**
Suppose converges uniformly for all , where is non increasing. Then for sufficiently small and any , there exist , and a universal constant such that
[TABLE]
where and for all .
The proof of Lemma 3 relies on the observation that taking an appropriately chosen linear combination of the form for some and presenting it as a power-series, results in all the coefficients of for being exactly zero, the coefficient of being , and the remaining coefficients all decaying rapidly to [math] as .
Proof.
Let be some even polynomial, and consider the set of functions
[TABLE]
These have the following expansions:
[TABLE]
Equating the coefficients , in , the expansion of , to the coefficients of , we obtain the matrix equality
[TABLE]
where is a diagonal matrix with the coefficients on its main diagonal, is the Vandermonde matrix given by
[TABLE]
and . Since , and since is invertible for small enough , Eq. (14) can be rearranged to
[TABLE]
Letting , we have that the coefficients up to degree agree with , thus to establish Eq. (13), it remains to bound the tail of the expansion for degrees . To this end, we will first bound each for . We have from Hölder’s inequality for all
[TABLE]
where is the -th row of , given by
[TABLE]
(Macon and Spitzbart, 1958). Bounding , we begin with the denominator to obtain for that if then
[TABLE]
Otherwise, if then
[TABLE]
Therefore
[TABLE]
Thus
[TABLE]
For the numerator we have
[TABLE]
which also holds for . Hence we have
[TABLE]
implying the -norm of the -th row is upper bounded by
[TABLE]
Combining the above with Eq. (15), we obtain the upper bound for some
[TABLE]
In general, the coefficient of the term for in the expansion of is given by
[TABLE]
Taking the absolute value and combining with Eq. (3.1), we get
[TABLE]
Finally, letting (note that this also entails ), we can bound the tail as follows
[TABLE]
∎
With the help of Lemma 3, we now turn to prove Thm. 8.
Proof of Thm. 8.
First, note that the family of functions satisfy the assumptions in Lemma 3 for any , as readily seen by their definition. Now, letting
[TABLE]
we obtain from Lemma 3 that
[TABLE]
for coefficients and satisfying
[TABLE]
To bound , observe that for any and , thus
[TABLE]
Plugging the above in Eq. (3.1) yields
[TABLE]
It now remains to approximate the function to accuracy (note that the is trivial, as it can be easy to simulate with a constant neuron). To this end, invoke Thm. 7 with a desired accuracy of , to obtain a network approximating . We stress that such approximation of is obtained for any , and since we have we are guaranteed to remain in the relevant domain. Taking such copies of , we obtain a width network
[TABLE]
approximating , since
[TABLE]
Combining Equations (17) and (19), we conclude that
[TABLE]
∎
Before we can prove Thm. 1, it only remains that we first prove the following lemma, establishing quantitative bounds on the ability of even polynomials having degree to approximate arbitrary -Lipschitz functions in , while having bounded coefficients. More formally, we have the following lemma:
Lemma 4**.**
Let be a -Lipschitz function. Then for any , there exists an even polynomial of degree such that
[TABLE]
and where the coefficients of , denoted are upper bounded by .
We remark that the exponent in the result can possibly be improved somewhat, but this will not change the exponential dependence on in our main theorem.
The following proof follows along a similar line as the proof provided by S. Bernstein for Weierstrass’ approximation theorem (see Koralov and Sinai (2007, Thm. 2.7) for the proof), albeit we also bound the magnitude of the coefficients of the approximating polynomial.
Proof.
Let be -Lipschitz. First, by approximating instead, we may assume w.l.o.g. that (adding the zero degree polynomial to our approximation once obtained). Extend to an even function on given by
[TABLE]
Letting , we linearly shift to the unit interval where is -Lipschitz. Define the Bernstein basis polynomials of degree as
[TABLE]
It is a well known fact that these polynomials form a partition of unity for any :
[TABLE]
Define the -th Bernstein polynomial approximation of as
[TABLE]
We compute using Eq. (20)
[TABLE]
Since is -Lipschitz, we have that implies , thus (21) is upper bounded by
[TABLE]
Recalling that , we have from Lipschitzness that . Therefore (22) is upper bounded by
[TABLE]
Observing Eq. (23) is exactly , where is binomially distributed. Using Chebyshev’s inequality we obtain
[TABLE]
Letting entails (22) is upper bounded by , yielding
[TABLE]
or equivalently by changing ,
[TABLE]
Denote . We shall now bound the coefficients of the approximating polynomial . We have
[TABLE]
To upper bound the coefficients, observe that taking the absolute value of and substituting with will result in a polynomial with only positive coefficients, upper bounding the ones of . Therefore
[TABLE]
Clearly, the coefficients of are upper bounded by . Finally, consider the even polynomial
[TABLE]
Its even coefficients are equal to those of and are thus bounded by . Moreover, we have
[TABLE]
By virtue of being even we have , and by Equations (24) and (25) we get for any
[TABLE]
concluding the proof of the lemma. ∎
We are finally ready to prove Thm. 1.
Proof of Thm. 1.
From Lemma 4, we have an even polynomial of degree , such that
[TABLE]
thus also
[TABLE]
Invoke Thm. 8 times to approximate each of to accuracy , using depth networks , , with activations of width . Thus obtaining for any
[TABLE]
Consider the depth network concatenating the networks , having output bias of and having width
[TABLE]
We compute for any
[TABLE]
From Equations (26) and (27), the above is upper bounded by
[TABLE]
The proof of Thm. 1 is complete. ∎
3.2 Proof of Thm. 3
Let be -Lipschitz on . By setting the bias term of the output neuron of the approximating depth network to , we may assume w.l.o.g. that to begin with. Moreover, since we do not care about the approximation attained on , we may set for any .
Now, instead of uniformly approximating directly, we can approximate a smoothed -approximation of it attained by , where is the convolution operation and is the Gaussian density function with mean and covariance matrix . Equivalently, we can define as
[TABLE]
where is distributed according to . We note that this is a uniform approximation of , since
[TABLE]
Since smooth functions have well-behaved Fourier transforms, this will make the use of Thm. 2 much more convenient. We thus have that attaining a uniform -approximation of on will suffice to finish the proof.
We begin by upper bounding . Since and is -Lipschitz, we have that , where denotes the -dimensional Lebesgue measure. Consequentially, since an upper bound implies a similar upper bound on the norm of the Fourier transform, we have that for any . Since the Fourier transform of a Gaussian pdf is another Gaussian with inverse variance, we have from the convolution-multiplication theorem that , for all . We thus compute
[TABLE]
where Eq. (28) is due to the absolute moments of a normal variable with mean [math] and standard deviation satisfying (see Winkelbauer (2012, Eq. (18))), Eq. (29) is due to and , and Eq. (30) is due to the inequality .
We now split our analysis into two cases, depending on the value of , the constant guaranteed from Thm. 2. In both cases we will need the following:
Claim 1**.**
We have
[TABLE]
The claim is a straightforward result derived by computing the partial derivatives of the left hand side and showing it is monotonically decreasing for any and in its domain, and therefore its proof is omitted.
Begin with assuming . Then substituting in Eq. (1), we get
[TABLE]
where the last inequality is due to Claim 1 and the assumption that , which will always hold for small enough since is always finite and positive for a non-constant .
For the second case, assume . Then choosing we similarly have
[TABLE]
where likewise, the last inequality uses Claim 1 and the assumptions that and .
We conclude using Eq. (31) that can be -approximated using a depth ReLU network of width
[TABLE]
completing the proof of Thm. 3.
3.3 Proof of Thm. 4
Our proof essentially reduces the assumptions in the theorem statement to those of Daniely (2017, Example 2), who showed that any depth ReLU network which approximates the non-radial function to an expected accuracy of at most with respect to the uniform distribution on , while having weights bounded by , necessarily has width at least .
Suppose that is approximable to accuracy using a depth network of width , having weights bounded by . i.e. suppose that
[TABLE]
Then in particular, we can choose to have a width network satisfying
[TABLE]
Now, let , and let , which is also a depth neural network with weights bounded by , since the scaling factor of can be simulated by multiplying the weights of the output neuron of by . We have using Eq. (32) that
[TABLE]
By taking a network which is identical to except for having its first layer weights (excluding bias terms) halved, i.e. bounded by , we have
[TABLE]
which implies that
[TABLE]
as well as
[TABLE]
since further restricting the domain cannot increase the supremum. Now, observe that
[TABLE]
Note that approximating a function and approximating its additive inverse using a neural network is equivalent (simply invert the weights of the output neuron), thus we can w.l.o.g. ignore the term in the above. Plugging Eq. (34) in Eq. (33) we obtain
[TABLE]
Finally, let be the network obtained from by duplicating its first layer weights excluding biases, i.e. , thus we have that for any . Plugging this in Eq. (35) we obtain
[TABLE]
That is, Eq. (36) establishes the existence of a width , depth ReLU network having weights bounded by , which uniformly approximates on (and in particular, provides such expected accuracy with respect to the uniform distribution on ). By Daniely (2017, Example 2), this implies that , concluding the proof of Thm. 4
3.4 Proof of Thm. 5
In this proof, we utilize the measure on used in Eldan and Shamir (2016) for their lower bound, whose density is given by the square of
[TABLE]
where , and is the Bessel function of the first kind, of order (see reference above for further information about these functions).
Suppose we have as in the theorem statement. Define and
[TABLE]
for some parameter . That is, is a -Lipschitz approximation of the indicator . To establish Thm. 5, we shall fix and consider the measure with density used in Safran and Shamir (2017, Thm. 1), for some and where is the universal constant from Eldan and Shamir (2016). We will show that
[TABLE]
and that there exists a depth network which is based on , having width for some universal , and satisfying
[TABLE]
This would imply Thm. 5, since Equations (37) and (38) yield
[TABLE]
and by plugging for some universal constant , we have from Safran and Shamir (2017, Thm. 1, Eq. (4)) that for any such depth neural network approximation of a ball indicator of radius w.r.t. the measure with density satisfying
[TABLE]
it must hold that the width of satisfies for any , some and small enough .
We begin by proving Eq. (37). Following a similar approach as in Eldan and Shamir (2016, Lemma 7). We have by definition that
[TABLE]
Changing to polar coordinates, where denotes the volume of the unit hypersphere in , the above equals
[TABLE]
Using the definition of , this equals
[TABLE]
where the inequality is due to the definitions of , since both functions are identical on except for the interval , where they deviate from each other by at most . Moreover, for such in the integration interval we have that Lemma 14 from Eldan and Shamir (2016) applies; therefore, the integral is upper bounded by
[TABLE]
where Eq. (37) follows by taking the square root.
Moving to Eq. (38), suppose we have as in the theorem statement, approximating to accuracy . Namely, we have a depth network of width such that
[TABLE]
for some constant and some density . Let be some invertible matrix to be determined later, and consider the change of variables , , which yields
[TABLE]
In particular, we may choose (note that this indeed defines a measure as readily seen by the change of variables , , yielding ). Plugging the chosen in Eq. (39) we obtain
[TABLE]
Observing that for any , expresses a linear transformation of the input which can be simulated by an appropriate modification of the weights in the hidden layer of , we choose and , where is the identity matrix, to obtain
[TABLE]
and
[TABLE]
Now, consider the network given by
[TABLE]
Note that this is indeed a depth network of width as a linear combination of depth networks. We will show that this network approximates
[TABLE]
Compute taking the square roots of Equations (41) and (42) to obtain
[TABLE]
implying Eq. (38), and concluding the proof of Thm. 5.
Acknowledgements
This research is supported in part by European Research Council (ERC) Grant 754705.
Appendix A Trading-Off and Radius of
Support
In this appendix, we formally show that given an inapproximability result for neural networks, using an -Lipschitz function, w.r.t. to some distribution with support of radius and accuracy , it is easy to get an inapproximability result even for -Lipschitz functions, at the cost of scaling either or polynomially in :
Theorem 9**.**
Let be an -Lipschitz function on , and a measure over with support bounded in for some . Suppose that
[TABLE]
where is some class of functions closed under scaling (namely, if , then for any is also in ).
Define the -Lipschitz function . Then it holds that
[TABLE] 2. 2.
Define the -Lipschitz function , and the measure by for any set in the -algebra of (where and assuming this set is also in the -algebra of ). Then has a support bounded in , and
[TABLE]
Proof.
By the assumptions, we have , so the first part follows from definition of and the fact that is closed under scaling. As to the second part, the assertion on the support of is immediate, and we have
[TABLE]
which is at least by our assumptions and the fact that is closed under scaling. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abbe and Sandon [2018] E. Abbe and C. Sandon. Provable limitations of deep learning. ar Xiv preprint ar Xiv:1812.06369 , 2018.
- 2Andrews et al. [1999] G. E. Andrews, R. Askey, and R. Roy. Special functions, volume 71 of encyclopedia of mathematics and its applications, 1999.
- 3Barron [1993] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory , 39(3):930–945, 1993.
- 4Boucheron et al. [2005] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics , 9:323–375, 2005.
- 5Cohen et al. [2015] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: a tensor analysis. ar Xiv preprint ar Xiv:1509.05009 , 556, 2015.
- 6Daniely [2017] A. Daniely. Depth separation for neural networks. ar Xiv preprint ar Xiv:1702.08489 , 2017.
- 7Delalleau and Bengio [2011] O. Delalleau and Y. Bengio. Shallow vs. deep sum-product networks. In NIPS , pages 666–674, 2011.
- 8Eldan and Shamir [2016] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In 29th Annual Conference on Learning Theory , pages 907–940, 2016.
