Linearized two-layers neural networks in high dimension
Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

TL;DR
This paper analyzes the approximation capabilities of linearized two-layer neural networks in high-dimensional settings, revealing how they fit polynomial functions and relate to kernel methods under different regimes.
Contribution
It provides a rigorous characterization of the polynomial approximation limits of random feature and neural tangent kernel models in high dimensions.
Findings
RF fits degree-ℓ polynomials in the approximation-limited regime
NT fits degree-(ℓ+1) polynomials in the approximation-limited regime
Kernel methods are limited to degree-ℓ polynomials in the sample size-limited regime
Abstract
We consider the problem of learning an unknown function on the -dimensional sphere with respect to the square loss, given i.i.d. samples where is a feature vector uniformly distributed on the sphere and . We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons diverges, for a fixed dimension . We consider two specific regimes: the approximation-limited regime, in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Linearized two-layers neural networks in high dimension
Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari Department of Electrical Engineering, Stanford UniversityInstitute for Computational and Mathematical Engineering, Stanford UniversityDepartment of Statistics, Stanford UniversityDepartment of Electrical Engineering and Department of Statistics, Stanford University
Abstract
We consider the problem of learning an unknown function on the -dimensional sphere with respect to the square loss, given i.i.d. samples where is a feature vector uniformly distributed on the sphere and . We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons diverges, for a fixed dimension .
We consider two specific regimes: the approximation-limited regime, in which while and are large but finite; and the sample size-limited regime in which while and are large but finite. In the first regime, we prove that if for small , then RF effectively fits a degree- polynomial in the raw features, and NT fits a degree- polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is , then kernel methods can fit at most a a degree- polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.
Contents
1 Introduction and main results
In the canonical statistical learning problem, we are given independent and identically distributed (i.i.d.) pairs , , where is a feature vector and is a label or response variable. We would like to construct a function which allows us to predict future responses. Throughout this paper, we will measure the quality of a predictor via its square prediction error (risk): .
1.1 Background
For a number of important applications, state-of-the-art performances are obtained by representing the function as a multi-layers neural network. The simplest model in this class is given by two-layers networks (NN):
[TABLE]
Here is the number of neurons and is an activation function.
Two-layers neural networks have been extensively studied in the nineties, with a focus on two goals: Establishing approximation guarantees over classical function spaces; Controlling the generalization error via Rademacher complexity arguments. We refer to [Pin99, AB09] for surveys of these results.
Computational aspects were notably under-represented within these early theoretical contributions. On the contrary, it is nowadays increasingly clear that computational and statistical aspects cannot be separated in the analysis of neural networks (see, e.g. [SHN*+*18, MMN18, CB18]). Indeed, the optimization algorithm does not simply compute the unique minimizer of a regularized empirical risk: it instead selects one among many possible near-minimizers, whose generalization properties can vary significantly. Therefore, the specific optimization algorithm is an integral part of the definition of the regularization method.
A concrete scenario in which this interplay can be understood precisely is the so-called ‘neural tangent kernel’ regime. First explicitly described in [JGH18], this regime has attracted considerable amount of work. The basic idea is that, for highly overparametrized networks, the network weights barely change from their random initialization. We can therefore replace the nonlinear function class by its first order Taylor expansion around this initialization.
Denoting by the weights at initialization, a first order Taylor expansion yields
[TABLE]
where is the neural network at initialization. In other words, is a function in the direct sum , where we defined
[TABLE]
Here is a matrix whose -th row is the vector , and is the derivative of the activation function with respect to its argument (if has a density, only needs to be weakly differentiable).
We will refer to as the ‘random features’ (RF) model: it amounts to fixing the first layer, and only optimizing the coefficients in the second layer. Equivalently, corresponds to the first order Taylor expansion of with respect to the second layer weights . This model can be traced back to the work of Neal [Nea96], and was successfully developed by Rahimi and Recht [RR08] as a randomized approximation to kernel methods.
The second function class corresponds to the first order Taylor expansion of with respect to the first layer weights [JGH18]. We will refer to as the neural tangent class111Often the term ‘neural tangent’ is reserved for the direct sum . We find it more convenient to give distinct names to each of the two terms, especially since has much smaller dimension than for large ..
A sequence of recent papers proves that, in a certain overparametrized regime, gradient descent (GD) applied to the nonlinear neural network class effectively converges to a model in . Namely, if the number of neurons is larger than a threshold , and training is initialized with where , then gradient descent converges exponentially fast to weights such that is well approximated by a function in . The specific value of the threshold for the onset of this NT regime has been steadily pushed down over the last year [DZPS18, DLL*+*18, AZLS18, ZCZG18, ADH*+*19].
Does the NT regime explain the power of multi-layers neural networks, when trained by gradient descent methods? From an empirical point of view, the evidence is not univocal [LXS*+*19, GSJW19, COB19]. From a theoretical point of view, while the expressivity of neural networks is superior to the one of NT models, this hypothesis is not easy to dismiss for at least two reasons. First, neural networks learned by gradient descent algorithms form a significantly smaller class than general networks. Second, the answer depends on the data distribution, the target function and the sample size.
In order to clarify this question, we explore the behavior of RF and NT models in the high-dimensional setting. More precisely, we consider two specific asymptotic regimes:
The infinite sample size case in which , and diverge while being polynomially related. In this case the prediction error reduces to the approximation error , for either model . 2.
The infinite width regime in which and diverge while being polynomially related. In this case (and under a suitable bound on the norm of the coefficients) both classes , reduce to certain reproducing kernel Hilbert spaces (RKHS).
In both cases we obtain sharp results, up to errors vanishing as . Crucially, our results hold pointwise, i.e. they provide a characterization of approximation and generalization error which hold for a given function . This allows us to derive precise separation results between NN and NT models.
1.2 A parenthesis
The approximation properties of neural networks have been studied for over three decades [DHM89, Cyb89, Hor91, Bar93, MM94, GJP95, Mha96, Pet98, Mai99, Pin99]. It is useful to discuss the relation between the questions outlined above and existing literature.
A number of results are available on the approximation of functions in certain smoothness classes by two-layers neural networks. In particular [Bar93] controls smoothness by the average frequency content in the Fourier transform (the ‘Barron norm’), while [Mha96, Pet98, Mai99] use classical Sobolev norms. For instance [Mai99] proves that -neurons NN approximate functions in the Sobolev ball with worst case error
[TABLE]
for some unspecified functions . (Similar results are found in [Pet98].) These results cannot be used for our purposes.
First of all, we are interested in the NT class which is potentially much less powerful than NN.
Second, bounds of the type (1) make it hard to prove separation results between NN and NT. In order to prove such a separation, we would have to prove that neural networks trained by gradient descent have good approximation properties, uniformly over Sobolev balls. This objective is currently out of reach. Our pointwise approximation results make it much easier to prove separation statements.
Third, earlier work neglects polynomial dependencies in . Bounds of the type (1) have weak implications when both and are large, say , . We will instead prove sharp asymptotic results that are valid in this regime. As illustrated in the next section, our analysis captures the actual behavior in a quantitative manner, already when .
Quantitative results in the high-dimensional regime have been proved only recently. In particular, Bach [Bac17b] established quantitative upper and lower bounds for the approximation error in the RF model. However, these results do not have direct implications on the NT model which is our main interest here. Further, lower bounds in [Bac17b] are, as before, worst case over a certain RKHS. (See also [Bac13, AM15, RR17] for related work.)
Similar considerations apply to the generalization error of kernel methods. While this is a classical topic [CST*+*00, CDV07, RR17, LR18], earlier work proves minimax upper and lower bounds. Establishing pointwise lower bounds is instead important in order to understand precisely the separation between neural networks and their linearized counterparts. We refer to Section 4 for further discussion of related work.
1.3 A numerical experiment
In order to illustrate the approximation behavior of RF and NT models, we present a simple simulation study. We consider feature vectors normalized so that , and otherwise uniformly random, and responses , for a certain function . Indeed, this will be the setting throughout the paper: (where denotes the sphere with radius in dimensions) and . We draw random weights . We use samples to fit a model in or . We learn the model parameters using least squares. If the model is overparametrized, we select the minimum -norm solution. (We refer to Appendix A for simulations using ridge regression instead.) We estimate the risk (test error) using fresh samples, and normalize it by the risk of the trivial model .
Figures 1, 2, 3 report the results of such a simulation using RF –for Figure 1– and NT –for Figures 2 and 3. We use shifted ReLU activations , . The choice of is not essential: (Lebesgue-)almost every has similar behavior. In contrast, the case is degenerate because is equal to a linear function plus an even function.
The target functions in these examples are quite simple. Figures 1 and 2 use a quadratic function . In Figure 3, the target function is a third-order polynomial .
The results are somewhat disappointing: in two cases (first and third figures) RF and NT models do not beat the trivial predictor. In one case (the second one), the NT model surpasses the trivial baseline, and it appears to decrease to [math] as the number of samples increase. We also note that the risk shows a cusp when , with the number of parameters ( for RF, and for NT). This phenomenon is related to overparametrization, and will not be discussed further in this paper (see [BHMM18, BHX19, HMRT19, MM19] for relevant work). We will instead focus on the population behavior .
In other words, the RF model does not appear to be able to learn a simple quadratic function, and the NT model does not appear to be able to learn a third order polynomial. Our main theorems (presented in the next sections) capture in a precise manner this behavior. In particular,
- •
We will prove that for , RF does not outperform the trivial predictor on any function that has vanishing projection on linear functions. Similarly, NT does not outperform the trivial predictor on any function that has vanishing projection on linear and quadratic functions.
- •
In contrast, there exists neural networks in with neurons, and a small approximation error both for and (see, e.g., [Bac17b], or [MMN18, Proposition 1]).
These two points illustrate the gap in approximation power between NT (or RF) and NN.
We demonstrate the second point empirically in Fig. 4 by choosing weight vectors , where are i.i.d. uniformly random indices, and the scaling factor is . Fixing these random first-layer weights, we fit the second-layer weights by least squares. The risk achieved is an upper bound on the minimum risk in the model, namely , and is significantly smaller than the baseline . (The risk reported in Fig. 4 can also be interpreted as a ‘random features’ risk. However, the specific distribution of the vectors is tailored to the function , and hence not achievable within the RF model.)
1.4 Summary of main results
Approximation error of RF models.
If for some , then the approximation error of RF is asymptotically equivalent to the approximation error of fitting a linear function in the raw covariates (i.e. least squares with the model , , ). More generally, if , then RF is equivalent to fitting a linear function over all monomials of degree at most in .
The equivalence between RF regression and polynomial regression holds pointwise for target function .
Approximation error of NT models.
If , then the approximation error of NT is asymptotically equivalent to the approximation error of fitting a linear function over monomials of degree at most two in (i.e. least squares with the model , , , ). More generally, if , then NT is equivalent to fitting a linear function over all monomials of degree at most in .
Again, this result holds pointwise over the choice of .
Generalization error of kernel methods.
We study the generalization error of kernel methods under the same data distribution described above, for any rotationally invariant kernel on the sphere . We prove two results:
- 1.
If the sample size is , then the generalization error of any kernel method is lower bounded by the approximation error of linear regression over monomials of degree at most in . 2. 2.
If the sample size satisfies , then the generalization error of Kernel Ridge Regression (KRR) is given by the approximation error of linear regression over monomials of degree at most in .
It is worth emphasizing two aspects of this last result. The first one is its generality. The NT kernel associated to an infinitely wide multi-layers fully connected neural network is always rotational invariant (assuming an i.i.d. Gaussian initialization of weights, which is common in practice). Therefore –in the NT regime– multi-layers neural networks cannot outperform the trivial predictor on a target function that has vanishing projection onto degree- polynomials, unless the sample size satisfies . (For instance, they cannot outperform the trivial predictor for unless .)
The second aspect can be summarized as follows.
Optimality of near interpolators.
For , the ideal behavior of KRR is achieved for all regularization values , with depending on and the activation function. In particular, it is achieved by ‘near interpolators’ (corresponding to ) i.e. functions that have negligible training error.
2 Approximation error of linearized neural networks
In this section, we state formally our results about the approximation error of and models. We define the minimum population error for any of the models by
[TABLE]
Notice that this is a random variable because of the random features encoded in the matrix . Also, it depends implicitly on , but we will make this dependence explicit only when necessary.
For , we denote by the orthogonal projector onto the subspace of polynomials of degree at most . (We also let .) In other words, is the function obtained by linear regression of onto monomials of degree at most . Throughout this paper ‘with high probability’ means ‘with probability converging to one as ’. The notations , , , mean, respectively, , , , . Given random variables , and deterministic quantities , we write (and so on) if the above holds in probability.
2.1 Approximation error of random features models
Assumption 1** (Assumptions for the RF model at level ).**
Let be a sequence of functions .
- (a)
, where is the distribution of for , and .
- (b)
We have
[TABLE]
where , and is the -th Gegenbauer polynomial (see Section 5).
Theorem 1** (Risk of the RF model).**
Let be a sequence of functions. Let with independently. Then the following hold.
- (a)
Assume for a fixed integer and any sequence such that (in particular, is sufficient for any fixed ). Let satisfy Assumption 1.(a). Then, for any , the following holds with high probability:
[TABLE]
- (b)
Assume for some integer , and satisfy Assumption 1.(b) at level . Then for any , the following holds with high probability:
[TABLE]
See Section 6 for the proof of lower bound, and Section 7 for the proof of upper bound.
In words, Eq. (3) amounts to say that when , the risk of the random feature model can be approximately decomposed in two parts, each non-negative, and each with a simple interpretation:
[TABLE]
The second contribution, is simply the risk achieved by linear regression with respect to polynomials of degree at most . The first contribution is the risk of the RF model when applied to the low-degree component of . Equation (4) implies that when , the first contribution vanishes asymptotically.
If both Assumptions 1. and 1. hold and for some integer , we thus obtain
[TABLE]
In particular, this shows that RF fits a linear function over polynomials of maximum degree .
Remark 2.1**.**
Note that Theorem 1. holds under very weak conditions on the activation function, which may depend on the dimension . The condition can also be rewritten as , where is the one-dimensional projection of the uniform measure over . In particular:
is supported on . It is therefore sufficient that . 2.
By an explicit calculation, the density of . Since this density is bounded, it is sufficient that is square integrable with respect to the Lebesgue measure on .
Remark 2.2**.**
If the activation is independent of , Assumption 1. is satisfied as long as for , where is the -th Hermite coefficient of (see Section 5 for definitions).
Remark 2.3**.**
The conclusion of Theorem 1. can be established222A first version of this manuscript, posted on arXiv, assumed such conditions. by a somewhat simpler proof if the activation function is independent of and satisfies the following regularity conditions: for some ; is not a polynomial of degree smaller than . Under these conditions, the conclusion holds for .
Note that Assumption 1. requires in particular that is not a polynomial of degree strictly smaller than . This is easily seen to be a necessary condition, since any linear combination of polynomials of degree is a polynomial of degree . For the same reason, this condition also arises in the approximation theory of neural networks [Pin99].
2.2 Approximation error of neural tangent models
For the NT model, the proof, while following the same scheme as for RF, is more challenging. We restrict our setting to a fixed activation function (independent of dimensions) which is weakly differentiable, with weak derivative that does not grow too fast (in particular, exponential growth is fine). We further require the Hermite decomposition of to satisfy a mild ‘genericity’ condition. Recall that the -th Hermite coefficient of a function can be defined as , where is the -th Hermite polynomial (see Section 5 for further background).
Assumption 2** (Assumptions for the NT model at level .).**
Let be an activation function .
- (a)
The function is weakly differentiable, with weak derivative such that for some constants , with .
- (b)
The Hermite coefficients are such that there exist such that and
[TABLE]
- (c)
The Hermite coefficients of satisfy for any .
Theorem 2** (Risk of the NT model).**
Let be a sequence of functions. Let with independently. We have the following results.
- (a)
Assume for a fixed integer , and let satisfy Assumptions 2.(a) and 2.(b) at level . Then, for any , the following holds with high probability:
[TABLE]
- (b)
Assume for some integer , and let satisfy Assumptions 2.(a) and 2.* at level . Then for any , the following holds with high probability:*
[TABLE]
See Section 8 for the proof of lower bound, and Section 9 for the proof of upper bound.
Remark 2.4**.**
It is easy to check that Assumptions 2. and 2. hold for all , for all commonly used activations.
For instance the ReLU activation and its weak derivative have subexponential growth. Further its Hermite coefficients are and
[TABLE]
which satisfy the required condition of Theorem 2. for each . (In checking the condition, it might be useful to notice the relation .)
Assumption 2.(c) does not hold for ReLU activation , since for even. However it holds for shifted ReLU , for a generic value of the shift .
Theorems 1 and 2 can be illustrated by a cartoon, which we show as Figure 5. In words, the approximation error plotted as a function of follows a staircase: it drops close to integer values of this ratio, with each drop corresponding to the projection onto homogeneous polynomials of that degree. We can extract three useful statistical insights from these findings:
There is no difference between plain RF and the more recent NT approach in terms of approximation error, once we compare them at fixed number of parameters . All that changes is the relation between number of parameters and number of neurons: for RF, and for NT. The recent work [GMMM19] actually shows some advantage for the RF model, although in a special case. It is worth mentioning that the same equivalence holds when we consider the dependence on the sample size , at , see Section 3.
We notice however an important computational advantage for NT, at constant parameters number. Indeed, the complexity at prediction time is for NT, while it is for RF. 2. 2.
RF or NT models behave similarly to expansions into orthogonal monomial basis. Also in that case, if only basis elements are included, for a ‘typical’ functions , the approximation error333Here by ‘typical’ function we mean the following. Choose a function , draw a Haar distributed orthogonal matrix , and set . is . 3. 3.
Our results also suggest interesting directions to improve random feature expansions. First, if is known to primary depends on a small subset of directions in , there will be a significant advantage in choosing the random features along that -dimensional subspace. Second, if the data points lie close to to such a subspace , , one might hope that –even if the are sampled isotropically in – random feature methods will be sensitive to rather than . We plan to report on these topics in a future publication [GMMM20].
2.3 Separation between NN and RF, NT
Theorems 1 and 2 imply a separation of approximation power between two-layers neural networks and their linearization. As a simple example, consider the target function , for . This can be represented exactly by a neural network with , i.e. by a single neuron. On the other hand, the above results imply that any RF or NT model is bound to have a non-vanishing population error, if . Provided satisfies the Assumptions 1, 2, we get
[TABLE]
Here is the projection of orthogonal to the subspace of polynomials of maximum degree , in , where is the standard Gaussian measure.
Crucially, as proven in [MBM16], running gradient descent over the space of neural networks consisting of a single neuron allows to learn the target function efficiently. In other words, we do not have simply a separation between the function classes and or , but a separation between linearized neural networks, and neural networks trained by gradient descent.
Essentially the same example was independently considered by Yehudai and Shamir in concurrent work [YS19]. These authors prove that there exist finite constants such that, if and the coefficients have magnitude at most , then there exists a vector such that, setting , then . An important difference with respect to our separation result is in the fact that Eq. 10 holds –once again– pointwise, i.e. for any fixed , while in [YS19] is chosen by an adversary who has knowledge of the vectors . Let us emphasize there are other important differences between our setting and the one of [YS19], and neither of the two analysis implies the other.
The same blueprint can be followed to prove further separation results. For instance, consider , for an orthogonal matrix and a bounded smooth function, which is not a polynomial. If is kept constant as , Theorems 1 and 2 can be used to show that are bounded away from zero and to compute their limits. On the other hand, by classical results [Mai99] can be used to show that such can be approximated arbitrarily well by neural networks with neurons (with first layer weights in the span of columns of ). Unfortunately, we are not aware of general results implying that such neural networks can be learnt by gradient descent, although we expect this to be the case for certain choices of . Whenever such a result is available, it implies a separation between RF, NT, and practical neural networks.
3 Generalization error of kernel methods
We consider next the limit of very wide networks. Namely, we let before . It is known since the work of Rahimi and Recht [RR08] that ridge regression over the function class converges in this limit to kernel ridge regression (KRR) with respect to the kernel (here expectation is with respect to )
[TABLE]
Analogously, ridge regression in can be shown to converge to KRR with respect to the kernel
[TABLE]
We will denote the corresponding RKHS by and . Quantitative estimates on the relation between and are obtained in [Bac17b], which shows that the unit ball of is well approximated by the unit ball of (endowed with the norm of the coefficients ), for large enough.
Notice that both kernels , are rotationally invariant, namely for and any orthogonal matrix . Any rotationally invariant kernel on the sphere takes the form
[TABLE]
for some function . (The scaling factor is introduced here to make contact with the normalization used in previous sections, and is not necessary: indeed, can depend itself on .)
Our results apply to general rotational invariant kernels under very weak conditions on the function . In particular, they apply to multilayer neural networks in the neural tangent regime. Namely consider a -layers network with matrix weights , , …, . As long as all the weights are initialized as independent centered Gaussians, with variance dependent only on the layer, the resulting NT kernel is rotationally invariant. The recent papers [DZPS18, DLL*+*18, AZLS18, ZCZG18, ADH*+*19] provide conditions under which the NT approximation is accurate for SGD-trained multilayer neural networks.
Section 3.1 presents a lower bound on the prediction error of general kernel methods, and Section 3.2 derives an upper bound for kernel ridge regression.
Throughout this section, we consider the same data model as in the previous sections: we observe pairs , with , and , and independently.
3.1 Lower bound for general kernel methods
Consider any regression method of the form
[TABLE]
where is the reproducing kernel Hilbert space (RKHS) norm with respect to the kernel of the form (13). By the representer theorem [BTA11] there exist coefficients such that
[TABLE]
We are therefore led to define the following data-dependent prediction risk function for kernel methods
[TABLE]
The next theorem provides a decomposition of this generalization error that is analogous to the one given in Theorem 1.. Notice however that the controlling factor is not the number of neurons , but instead the sample size .
Theorem 3**.**
Assume for a fixed integer and any sequence such that (in particular, is sufficient for any fixed ). Let be a sequence of functions, with . Assume . Then for any , with high probability as , we have
[TABLE]
Proof.
This follows immediately from Theorem 1.. Indeed, setting and , we obtain , whence the claim follows by applying Eq. (3). ∎
3.2 Upper bound for kernel ridge regression
Kernel ridge regression is one specific way of selecting the coefficients in Eq. (15), namely by using in Eq. (14). Solving for the coefficients yields
[TABLE]
where the kernel matrix is given by
[TABLE]
and . The prediction function at location is given by
[TABLE]
where
[TABLE]
The test error of empirical kernel ridge regression is defined as
[TABLE]
We assume that are positive-definite kernels, and we consider the associated eigenvalues:
[TABLE]
where we recall that is the -th Gegenbauer polynomial.
Assumption 3** (Assumption for KRR at level ).**
Let be a sequence of functions , such that is a positive semidefinite kernel.
- (a)
, where is the distribution of for , where .
- (b)
There exists a constant such that
[TABLE]
Theorem 4**.**
Assume for some integer and . Let be a sequence of functions. Let be a sequence of kernels satisfying Assumption 3 at level . Further define
[TABLE]
If has zero mean (i.e. ) further assume that is centered (i.e. ).
Let with independently, and and . Then for any , and any regularization parameter with high probability we have
[TABLE]
See Section 10 for the proof of this theorem.
Remark 3.1**.**
Assume as , uniformly over , together with its derivatives, and further assume for some , . We expect this to be the case for many kernels of interest, and in particular it can be shown to be the case for and under mild conditions on the activation . Using Rodrigues’ formula described in Section 5.2, by an application of integration by part followed by dominated convergence, we get
[TABLE]
where is the -th derivative of . Notice further that for all since is positive semidefinite by definition. Therefore, as long as for all , Assumption 3 is satisfied, and is bounded away from [math].
Remark 3.2**.**
For and if the activation is independent of , we have , and therefore Assumption 3 is satisfied as soon as for all .
Notice that the setting of Theorem 4 is the same as in classical nonparametric regression. However, classical theory typically establishes minimax consistency rates of the form [Tsy08, GKKW06]. In order to guarantee a fixed (small) error, these bounds require . Modern machine learning typically have and between and , and it is therefore unrealistic to consider exponential in . This regime motivates a new type of question: assuming , what is the minimum prediction error that can be achieved? This question is addressed by Theorem 4.
3.3 Separation between kernel methods and neural networks
Repeating the same argument of Section 2.3, we see that Theorems 3 and 4 imply a separation between kernel methods, with rotationally invariant kernels, and gradient-descent trained neural networks.
Namely, consider again the target function , for . As proven in [MBM16], can be learnt efficiently by minimizing the following empirical risk via gradient descent:
[TABLE]
Namely, if samples are used (and under some technical conditions on ), gradient descent reaches prediction error of order
In contrast, Theorems 3 and 4 imply that, for any integer , and any , any kernel method has test error bounded away from zero. Namely
[TABLE]
This test error is achieved by kernel ridge regression.
3.4 Near-optimality of interpolators
Let us emphasize some important statistical aspects of Theorem 4. KRR is proved to achieve near optimal prediction error (matching the lower bound of Theorem 3) pointwise, i.e. per given function . What is the nature of the predictor ? Theorems 3 and 4 imply that, in sense, must be close to a low-degree approximation of , namely .
Optimal test error is achieved for any . In particular, by taking , we obtain an interpolator, i.e. a predictor that interpolates the data . This remark is made quantitative in the following bound on the empirical risk
[TABLE]
Theorem 5**.**
Assume for some integer and . Under the same assumptions of Theorem 4, if , then
[TABLE]
where .
Proof of Theorem 5.
Recall that the empirical risk of KRR is given by Eq. (24), where can be rewritten as
[TABLE]
Therefore,
[TABLE]
From the proof of Theorem 4, we have the following lower bound on the eigenvalues . We deduce that with high probability
[TABLE]
where we simply used the law of large numbers . ∎
3.5 A conjecture for generalization error of random features model
Consider random features regression with finite sample size and a finite number of neurons. We fit data using ridge regression in the random features () model, with (where )
[TABLE]
Under the same data model of the previous sections, we are interested in the test prediction error
[TABLE]
Theorem 1 characterized the test error in the population limit , whereas Theorems 3 and 4 characterize the same quantity in the case when .
What happens when both and are finite? In the proportional regime and , the precise asymptotics of was calculated in [MM19].
What happens beyond the proportional asymptotics? We conjecture that the limiting factor is given by the smallest of and . Namely, if for some positive , then the prediction error is the same as the one of fitting a degree- polynomial, i.e. . We leave this conjecture to future work.
4 Further related work
Donoho and Johnstone [DJ89] study an approximation problem analogous to the one we considered in Section 2, although in dimensions. Their problem essentially reduces to determining rates of approximation on the unit circle, with the technical difference that the ’s are equi-spaced along the circle instead of being random. As for other references mentioned in Section 1.2, the lower bounds of [DJ89] are worst case over differentiable functions.
The limitations of kernel methods in high-dimension are studied by El Karoui in [EK10b] (see also [EK10a]), which analyzes kernel random matrices of the form . The analysis of [EK10b] is limited to the proportional asymptotics . and establishes that in this regime is well approximated by the Gram matrix of raw feature vectors plus a diagonal term: , where . This result is related to our Theorems 3 and 4, which deal with kernel methods. However our results analyze general polynomial scalings , while [EK10b] assumes . Also [EK10b] analyzes the spectrum of but not the prediction error of kernel methods. Finally, a large part of our technical work is devoted to RF and NT models, cf. Theorems 1 and 2, which are not touched upon by [EK10b].
Recent work of Vempala and Wilmes [VW18] analyzes what amounts to an RF model. These authors prove that RF can learn a degree- polynomial from samples using neurons, and that at least queries are needed within the statistical query model. While related, our setting is not directly comparable to theirs. Notice further that we obtain a sharper tradeoff, since we obtain the precise exponents of .
After the present paper appeared as a preprint, several authors presented important contributions to the same line of work. In particular, Liang, Rakhlin, and Zhai [LRZ19] studies kernel ridge regression in dimension using samples. Assuming the target function has bounded RKHS norm, they derive upper and lower bounds on the rate of convergence of the generalization error. This result is related to our Theorem 3. The most important difference is that we do not assume that the target function has bounded RKHS norm. Instead we obtain the precise asymptotics of the generalization error in a regime in which it is non-vanishing. As illustrated in Section 1.3, this asymptotic analysis captures indeed the actual behavior in practically reasonable settings.
From a technical viewpoint, several of our calculations make use of harmonic analysis over the -dimensional sphere, as it is natural given that ’s are uniform over the sphere. Spherical harmonics expansion appear in related contexts, e.g. in [DJ89, Bac17a, VW18].
Let us finally mention that an alternative approach to the analysis of two-layers neural networks in the wide limit, was developed in [MMN18, RVE18, SS18, CB18, MMM19] using mean field theory. Unlike in the neural tangent approach, the evolution of network weights is described beyond the linear regime in this theory.
5 Technical background
In this section we introduce some notation and technical background which will be useful for the proofs in the next sections. In particular, we will use decompositions in (hyper-)spherical harmonics on the and in orthogonal polynomials on the real line. All of the properties listed below are classical: we will however prove a few facts that are slightly less standard. We refer the reader to [EF14, Sze39, Chi11] for further information on these topics. As mentioned above, expansions in spherical harmonics were used in the past in the statistics literature, for instance in [DJ89, Bac17a].
5.1 Functional spaces over the sphere
For , we let denote the sphere with radius in . We will mostly work with the sphere of radius , and will denote by the uniform probability measure on . All functions in the following are assumed to be elements of , with scalar product and norm denoted as and :
[TABLE]
For , let be the space of homogeneous harmonic polynomials of degree on (i.e. homogeneous polynomials satisfying ), and denote by the linear space of functions obtained by restricting the polynomials in to . With these definitions, we have the following orthogonal decomposition
[TABLE]
The dimension of each subspace is given by
[TABLE]
For each , the spherical harmonics form an orthonormal basis of :
[TABLE]
Note that our convention is different from the more standard one, that defines the spherical harmonics as functions on . It is immediate to pass from one convention to the other by a simple scaling. We will drop the superscript and write whenever clear from the context.
We denote by the orthogonal projections to in . This can be written in terms of spherical harmonics as
[TABLE]
We also define , , and , .
5.2 Gegenbauer polynomials
The -th Gegenbauer polynomial is a polynomial of degree . Consistently with our convention for spherical harmonics, we view as a function . The set forms an orthogonal basis on , where is the distribution of when , satisfying the normalization condition:
[TABLE]
In particular, these polynomials are normalized so that . As above, we will omit the superscript when clear from the context.
Gegenbauer polynomials are directly related to spherical harmonics as follows. Fix and consider the subspace of formed by all functions that are invariant under rotations in that keep unchanged. It is not hard to see that this subspace has dimension one, and coincides with the span of the function .
We will use the following properties of Gegenbauer polynomials
For
[TABLE] 2. 2.
For
[TABLE] 3. 3.
Recurrence formula
[TABLE] 4. 4.
Rodrigues’ formula
[TABLE]
Note in particular that property 2 implies that –up to a constant– is a representation of the projector onto the subspace of degree - spherical harmonics
[TABLE]
For a function (where is the distribution of when ), denoting its spherical harmonics coefficients to be
[TABLE]
then we have the following equation holds in sense
[TABLE]
To any rotationally invariant kernel , with , we can associate a self adjoint operator via
[TABLE]
By rotational invariance, the space of homogeneous polynomials of degree is an eigenspace of , and we will denote the corresponding eigenvalue by . In other words . The eigenvalues can be computed via
[TABLE]
5.3 Hermite polynomials
The Hermite polynomials form an orthogonal basis of , where is the standard Gaussian measure, and has degree . We will follow the classical normalization (here and below, expectation is with respect to ):
[TABLE]
As a consequence, for any function , we have the decomposition
[TABLE]
The Hermite polynomials can be obtained as high-dimensional limits of the Gegenbauer polynomials introduced in the previous section. Indeed, the Gegenbauer polynomials (up to a scaling in domain) are constructed by Gram-Schmidt orthogonalization of the monomials with respect to the measure , while Hermite polynomial are obtained by Gram-Schmidt orthogonalization with respect to . Since (here denotes weak convergence), it is immediate to show that, for any fixed integer ,
[TABLE]
Here and below, for a polynomial, is the vector of the coefficients of . As a consequence, for any fixed integer , we have
[TABLE]
where and are given in Eq. (42) and (38).
5.4 Notations
Throughout the proofs, (resp. ) denotes the standard big-O (resp. little-o) notation, where the subscript emphasizes the asymptotic variable. We denote (resp. ) the big-O (resp. little-o) in probability notation: if for any , there exists and , such that
[TABLE]
and respectively: , if converges to [math] in probability.
We will occasionally hide logarithmic factors using the notation (resp. ): if there exists a constant such that . Similarly, we will denote (resp. ) when considering the big-O in probability notation up to a logarithmic factor.
6 Proof of Theorem 1.(a): RF model lower bound
6.1 Proof of Theorem 1.(a): Outline
Recall that independently. We define for , so that independently. Let , and . We denote to be the expectation operator with respect to , to be the expectation operator with respect to , and to be the expectation operator with respect to .
Define the random vectors , , , with
[TABLE]
Define the random matrix , with
[TABLE]
In what follows, we write for the random features risk, omitting the dependence on the weights . By the definition and a simple calculation, we have
[TABLE]
By orthogonality, we have
[TABLE]
which gives
[TABLE]
where the last inequality used the fact that
[TABLE]
so that
[TABLE]
We claim that we have
[TABLE]
This is achieved by the Proposition 1 and 2 stated below.
We will denote below by , , the coefficients of in the basis of Gegenbauer polynomials. Explicitly, since , we can expand as
[TABLE]
where
[TABLE]
Proposition 1** (Expected norm of ).**
Let be a sequence of activation functions with . Define by
[TABLE]
Then
[TABLE]
Proposition 2** (Lower bound on the kernel matrix).**
Assume for a fixed integer and any (in particular, is sufficient for any fixed ). Let independently, and be a sequence of activation functions with . Let be the kernel matrix defined by Eq. (48). Then for any ,
[TABLE]
with high probability as .
The proof of Proposition 2 relies on the following tight bound on the operator norm of the Gegenbauer polynomials of the Gram matrix:
Proposition 3** (Bound on the Gram matrix).**
Let for a fixed integer and any . Let independently, and be the ’th Gegenbauer polynomial with domain . Consider the random matrix , with . Then we have
[TABLE]
The proofs of these three propositions are provided in the next sections. Proposition 1 implies
[TABLE]
From Proposition 2, we have with high probability
[TABLE]
Then by Markov inequality, we have with high probability
[TABLE]
Equation (50) follows by noting that is non-decreasing in (see Lemma 1 below) and , and recalling . Combining with Eq. (49), the theorem holds.
Lemma 1**.**
The number of independent degree- spherical harmonics on is non-decreasing in for any fixed .
Proof of Lemma 1.
By [EF14, Section 4.1], we have
[TABLE]
and
[TABLE]
where is non-negative for . This immediately shows that is non-decreasing in . ∎
6.2 Proof of Proposition 1
The quantity can be rewritten as
[TABLE]
First we calculate . Note the spherical harmonics expansion of gives
[TABLE]
and the Gegenbauer expansion of gives
[TABLE]
By the fact that
[TABLE]
we have
[TABLE]
We deduce that
[TABLE]
This proves the proposition.
6.3 Proof of Proposition 2
Recall the expansion of in terms of Gegenbauer polynomials, see Eqs. (51) and (52). From the properties of Gegenbauer polynomials, we have
[TABLE]
We can therefore decompose :
[TABLE]
where with .
Define
[TABLE]
Note that
[TABLE]
where is given by
[TABLE]
As a result, we have , and hence
[TABLE]
In the following, we give a lower bound for . Note we have
[TABLE]
By Proposition 3, we have
[TABLE]
Further we have
[TABLE]
For sufficiently large, there exists such that for any :
[TABLE]
Hence, there exists constant , such that for large , we have
[TABLE]
Recalling that , and , we deduce
[TABLE]
Combining Eq. (54) and (55) we get
[TABLE]
Plug Eq. (56) into Eq. (53), we get with high probability
[TABLE]
Hence the proposition follows.
6.4 Proof of Proposition 3
Step 1. Bounding operator norm by moments.
We define . Then we have
[TABLE]
For any sequence of integers , we have
[TABLE]
To prove the proposition, it suffices to show that for any sequence , we have
[TABLE]
In the following, we calculate . We have
[TABLE]
To calculate this quantity, we will apply repeatedly the following identity, which is an immediate consequence of Eq. (33). For any distinct, we have
[TABLE]
Throughout the proof, we will denote by constants that may depend on but not on . The value of these constants is allowed to change from line to line.
**Step 2. The induced graph and equivalence of index sequences. **
For any index sequence , we defined an undirected multigraph associated to index sequence . The vertex set is the set of distinct elements in . The edge set is formed as follows: for any we add an edge between and (with convention ). Notice that this could be a self-edge, or a repeated edge: will be –in general– a multigraph. We denote to be the number of vertices of , and to be the number of edges (counting multiplicities). In particular, for . We define
[TABLE]
For any two index sequences , we say they are equivalent , if the two graphs and are isomorphic, i.e. there exists an edge-preserving bijection of their vertices (ignoring vertex labels). We denote the equivalent class of to be
[TABLE]
We define the quotient set by
[TABLE]
For any integer and , we define
[TABLE]
Lemma 2**.**
The following properties holds for all sufficiently large and :
For any equivalent index sequences , we have .
For any index sequence , we have .
For any index sequence , the degree of any vertex in must be even.
The number of equivalent classes .
Recall that denotes the number of distinct elements in . Then, for any , the number of elements in the corresponding equivalence class satisfies .
Proof.
Properties , and are straightforward. Note that for any . For property , notice that to each distinct equivalence class we can associate, in an injective manner, a string of length over an alphabet of size (simply follow the elements in in order, and replace the labels by some canonical ones, e.g. in order of appearance). Therefore the number of classes is bounded as
[TABLE]
For property , we need to bound the number of elements in for representative with degree . Define a mapping as follows. For , is a vector of the distinct elements in , listed in increasing order. For any , the pre-image contains at most elements. As a result, we have
[TABLE]
This proves property . ∎
In view of property in the last lemma, given an equivalence class , we will write for the corresponding value common to the equivalence class .
**Step 3. The skeletonization process. **
For multi-graph , we say that one of its vertices is redundant, if it has degree 2. For any index sequence (i.e. such that does not have self-edges), we denote by to be the redundancy of , and by to be the skeleton of , both defined by the following skeletonization process. Let . For any integer , if has no redundant vertices then stop and set . Otherwise, select a redundant vertex arbitrarily (the -th element of ). If , then remove from the graph (and from the sequence), together with its adjacent edges, and connect and with an edge, and denote to be the resulting index sequence, i.e., . If , then remove from the graph (and from the sequence), together with its adjacent edges, and denote to be the resulting index sequence, i.e., . (Here , and have to be interpreted modulo , the length of .) The redundancy of , denoted by , is the number of vertices removed during the skeletonization process.
It is easy to see that the outcome of this process is independent of the order in which we select vertices.
Example 1**.**
For illustration, we give two examples of skeletonization processes:
- •
Let , and set . First notice that are redundant vertices and we can remove them in arbitrary order to get . Then notice that is redundant whence we get . Hence we have , and .
- •
Consider the skeletonization process of . Take . First notice that are redundant vertices and can be removed in arbitrary order to get . We see that there is no further redundant vertex in , so that , and .
Lemma 3**.**
For the above skeletonization process, the following properties hold:
If , then . That is, the skeletons of equivalent index sequences are equivalent.
For any , define
[TABLE]
Then we have
[TABLE]
For any , its skeleton is either formed by a single element, or an index sequence whose graph has the property that every vertex has degree greater or equal to .
Proof.
Property holds by the definition of equivalence which is graph isomorphism. Property used the fact that, if and , we have
[TABLE]
so that deleting a redundant vertex will contribute a factor.
To show property , note that any intermediate index sequence in the skeletonization process is such that only has even degree vertices, is connected, and has no self-edges (by induction). Hence, only has even degree vertices, is connected, and has no self-edges. Note that cannot have degree-2 vertices, and has at least one vertex (because the last vertex is not removed). Therefore, as long as contains at least two vertices, can only contain vertices with degree greater or equal to . ∎
Given an index sequence , we say is of type 1, if contains only one index. We say is of type 2 if has more than one index (so that by Lemma 3, can only contain vertices with degree greater or equal to ). Denote the class of type 1 index sequence (respectively type 2 index sequence) by (respectively ). We also denote by , the set of equivalence classes of sequences in . This definition makes sense since the equivalence class of the skeleton of a sequence only depends on the equivalence class of the sequence itself.
**Step 4. Type 1 index sequences. **
Recall that is the number of vertices in , and is the number of edges in (which coincides with the length of ). We consider . Since for , every edge of must be at most a double edge. Indeed, if had multiplicity larger than in , neither nor could be deleted during the skeletonization process, contradicting the assumption that contains a single vertex. Therefore, we must have . According the Lemma 3., for every , we have
[TABLE]
Note by Lemma 2., the number of elements in the equivalence class of is . Hence we get
[TABLE]
Therefore
[TABLE]
where in the last step we used Lemma 2 and the fact that for some .
**Step 5. Type 2 index sequences. **
We have the following simple lemma bounding . This bound is useful when is a skeleton.
Lemma 4**.**
There exists constants and depending uniquely on such that, for any , and any index sequence with , we have
[TABLE]
Proof.
By Holder’s inequality, we have
[TABLE]
The lemma following by the claim that (for )
[TABLE]
In the following, we will write for the coefficient of in the polynomial . To show the above claim, recall that we have, for any ,
[TABLE]
Therefore there exists a constant such that for all large enough
[TABLE]
As a consequence, for any integer , we have
[TABLE]
Define the random variable for . The probability distribution of is given by given in Eq. (78) below. Hence defining , we have (since for all large enough)
[TABLE]
where . Therefore, for all ,
[TABLE]
Combining the above two upper bounds (63) and (64), we have
[TABLE]
By noting that for some , this proves the claim. ∎
Suppose , and denote to be the number of vertices in . We have, for a sequence
[TABLE]
Here holds by Lemma 3.; by Lemma 4, and the fact that , together by ; because ; by Lemma 3., implying that for , each vertex of has degree greater or equal to , so that (notice that for we can assume ). Finally, follows since , and the definition of implying .
Note by Lemma 2., the number of elements in equivalent class . Since depends only on the equivalence class of , we will write, with a slight abuse of notation . Notice that the number of equivalence classes with is upper bounded by the number multi-graphs with vertices and edges, which is at most . Hence we get
[TABLE]
Define . We will assume hereafter that is selected such that
[TABLE]
By calculus and condition (68), the function is maximized over at , whence
[TABLE]
**Step 6. Concluding the proof. **
Using Eqs. (61) and (69), we have, for any satisfying Eq. (68), we have
[TABLE]
Form Eq. (57), we obtain
[TABLE]
Finally setting and , this yields
[TABLE]
Therefore, as long as , we have . It is immediate to check that the above choice of satisfies the required conditions and Eq. (68) for all large enough.
7 Proof of Theorem 1.(b): RF model upper bound
Recall that independently. We define for , so that independently. Let , and . We denote to be the expectation operator with respect to , to be the expectation operator with respect to , and to be the expectation operator with respect to .
Without loss of generality, assume that are polynomials of degree at most , i.e. . We denote the expansion of in terms of Gegenbauer polynomials by (for )
[TABLE]
where
[TABLE]
Denote . We introduce the operator , such that for any
[TABLE]
In particular, for any and , we have
[TABLE]
It is easy to check that (the adjoint operator) has the same expression as with and swapped. We define the operator as . For any , we have
[TABLE]
where
[TABLE]
We will restrict ourselves to the subspace of polynomials of degree less or equal to . We have for and ,
[TABLE]
Hence is an orthogonal basis that diagonalizes on . By Assumption 1.(b), we deduce that is a bijection from to itself for sufficiently large. In particular, its restricted inverse is well defined.
Consider . We can expand the risk achieved at parameter as
[TABLE]
Let us define and choose . We consider the expectation over of the RF risk:
[TABLE]
It is easy to check that . Hence
[TABLE]
Recall the decomposition of in terms of spherical harmonics (and note we assumed is a degree polynomial)
[TABLE]
and the equations (74) and (75), we get
[TABLE]
As a result, we deduce that
[TABLE]
Hence, by Assumption 1.(b), and from the assumption that , we deduce that the risk converges in to [math], and therefore in probability.
8 Proof of Theorem 2.(a): NT model lower bound
8.1 Preliminaries
We begin with some notations and simple remarks.
Lemma 5**.**
Assume is an activation function with for some constants and . Then
. 2.
Let . Then there exists such that, for ,
[TABLE] 3.
Let . Then there exists a coupling of and such that
[TABLE]
Proof.
Claim 1 is obvious.
For claim 2, note that the probability distribution of when is given by
[TABLE]
A simple calculation shows that as , and hence . Therefore
[TABLE]
where the last inequality holds provided .
Finally, for point 3, without loss of generality we will take , so that . By the same argument given above (and since both and have densities bounded uniformly in ), for any we can choose bounded continuous so that for any ,
[TABLE]
It is therefore sufficient to prove the claim for . Letting , independent of , we construct the coupling via
[TABLE]
where we set . We thus have almost surely, and the claim follows by weak convergence. ∎
We denote the Hermite decomposition of by
[TABLE]
We state separately the assumptions of Theorem 2.(a) for future reference.
Assumption 4** (Integrability condition).**
The activation function is weakly differentiable with weak derivative . There exist constants , , with and such that, for all , .
Assumption 5** (Level- non-trivial Hermite components).**
Recall that denote the -th coefficient of the Hermite expansion of (with the standard Gaussian measure).
Then there exists such that and
[TABLE]
It is also useful to notice that the Hermite coefficients of can be computed from the ones of using the relation .
8.2 Proof of Theorem 2.(a): Outline
The proof for the NT model follows the same scheme as for the RF case. However, several steps are technically more challenging. We will follow the same notations introduced in Section 6.1. In particular will denote, respectively, expectation with respect to , , .
We define the random vector , where, for each , , and analogously , , as follows
[TABLE]
We define the random matrix , where, for each , , is given by
[TABLE]
Proceeding as for the RF model, we obtain
[TABLE]
We claim that we have
[TABLE]
This is achieved in the following two propositions.
Proposition 4** (Expected norm of ).**
Let be an activation function satisfying Assumption 4. Define
[TABLE]
where expectation is with respect to . Then there exists a constant (depending only on the constants in Assumption 4) such that, for any and ,
[TABLE]
Proposition 5** (Lower bound on the kernel matrix).**
Let for some , and independently. Let be an activation that satisfies Assumption 4 and Assumption 5. Let be the kernel matrix with block defined by Eq. (86). Then there exists a constant that depends on the activation function , such that
[TABLE]
with high probability as .
These two propositions will be proven in the next sections. Proposition 4 shows that
[TABLE]
Note , and . By Markov inequality, we have Eq. (87). Equation (88) follows simply by Proposition 5. This proves the theorem.
8.3 Proof of Proposition 4
We denote the Gegenbauer decomposition of by
[TABLE]
where
[TABLE]
By Lemma 5, applied to function (instead of ), under Assumption 4, we have (for a constant independent of ). We therefore have (recalling the normalization of the Gegenbauer polynomials in Eq. (32))
[TABLE]
We define the NT kernel by
[TABLE]
Then
[TABLE]
where in the last step we used Eq. (33). By the recurrence relationship for Gegenbauer polynomials (35), we have
[TABLE]
where
[TABLE]
We use the convention that . This gives
[TABLE]
Hence we get
[TABLE]
where
[TABLE]
The last inequality follows by Eqs. (89) and (91).
We define
[TABLE]
Using the fact that the kernel preserve the decomposition (29), we have
[TABLE]
Note by Eq. (90), we have (as always, expectations are with respect to independently)
[TABLE]
where the fourth equality used the fact that .
Hence we have
[TABLE]
where we used the fact that is non-decreasing in given by Lemma 1. This concludes the proof.
8.4 Proof of Proposition 5
8.4.1 Auxiliary lemmas
In the proof of this proposition, we will need the following lemmas.
Lemma 6**.**
Let be a function such that and . Let be the coefficients of its expansion in terms of the -th order Gegenbauer polynomials
[TABLE]
Then we can write
[TABLE]
with the new coefficients given by
[TABLE]
Proof.
We recall the following two formulas for (see Section 5.2):
[TABLE]
Furthermore, we have , and therefore therefore . We insert these expressions in the expansion of the function
[TABLE]
Matching the coefficients of the expansion yields
[TABLE]
∎
Similarly, we can write the decomposition of to be
[TABLE]
where the coefficients are given by the same relation as in the above lemma
[TABLE]
Lemma 7**.**
Let be a matrix-valued function defined by
[TABLE]
Then there exist functions such that
[TABLE]
Proof.
Case 1: .
We first consider the case . We will denote for convenience. Given any three functions , we define
[TABLE]
Let us rotate and such that and . We can rewrite
[TABLE]
where
[TABLE]
Similarly, we can write
[TABLE]
where
[TABLE]
We check in both cases that:
[TABLE]
We conclude that and are equal if and only if
[TABLE]
We can therefore choose for
[TABLE]
Case 2: .
Similarly, for some fixed and , we define
[TABLE]
We can show that the matrices and are equal if and only if
[TABLE]
We can therefore fix and . ∎
Lemma 8**.**
Let be an activation function such that for some constants , with . Let the Hermite and Gegenbauer decompositions of be
[TABLE]
Then we have for any fixed ,
[TABLE]
Proof.
Recall the correspondence (43) between Gegenbauer and Hermite polynomials. Note for any monomial , by Lemma 5., we have
[TABLE]
This gives for any fixed , we have
[TABLE]
This proves the lemma. ∎
Lemma 9**.**
For any fixed , let be the -th Gegenbauer polynomial. We expand
[TABLE]
Then we have
[TABLE]
Proof.
Using the correspondence (43) between Gegenbauer and Hermite polynomials we have
[TABLE]
This gives
[TABLE]
This proves the lemma. ∎
Lemma 10**.**
Let for a fixed integer . Let independently. Denote a matrix with
[TABLE]
Then as , we have
[TABLE]
Proof.
Let us consider , and its first coordinate. We have which has density on , cf. Eq. (78):
[TABLE]
where the last inequality holds for all large enough, since as . Hence, we have:
[TABLE]
Taking , we get
[TABLE]
Using the following bound:
[TABLE]
which concludes the proof. ∎
8.4.2 Proof of Proposition 5
Step 1. Construction of the activation function .
By Assumption 4 and Lemma 5 (applied to instead of ), we have and we consider its expansion in terms of Gegenbauer polynomials (as always, expectation is taken with respect to with ):
[TABLE]
Let be two indices that satisfy the conditions of Assumption 5. Using the Gegenbauer coefficients of , we define by
[TABLE]
for some that we will fix later (with ).
Step 2. The functions and .
Let and be the matrix-valued functions associated respectively to and
[TABLE]
From Lemma 7, there exists functions and , such that
[TABLE]
We define . Then we can write
[TABLE]
where for .
Step 3. Construction of the kernel matrices.
Let with -th block (for ) given by
[TABLE]
Note that we have . By Eq. (101) and (98), it is easy to see that . Then we have . In the following, we would like to lower bound matrix .
We decompose as
[TABLE]
where is a block-diagonal matrix, with
[TABLE]
and is formed by blocks for , defined by
[TABLE]
In the rest of the proof, we will prove that and for small enough with high probability.
Step 4. Prove that .
Denoting , we get, from Eq. (92),
[TABLE]
Using the notations of Lemma 6, we get
[TABLE]
We get similar expressions for with replaced by . Because we defined and by only modifying the -th and -th coefficients, we get
[TABLE]
Recalling that only depend on and (Lemma 6), we get
[TABLE]
By Assumption 4 and the convergence in Lemma 8, for any fixed ,
[TABLE]
Using the expression of we get
[TABLE]
From Lemma 9, we recall that the coefficients of the -th Gegenbauer polynomial satisfy
[TABLE]
Furthermore, we have shown in Lemma 10 that . We deduce that
[TABLE]
Plugging the estimates (109), (110) and (112) into Eqs. (107) and (108), we obtain that
[TABLE]
From Eq. (106), using the fact that and Cramer’s rule for matrix inversion, it is easy to see that
[TABLE]
We deduce from (113) (105) and (114) that
[TABLE]
As a result, combining Eq. (115) with Eq. (102) and (99), we get
[TABLE]
By the expression of given by (104), we conclude that
[TABLE]
Since , we deduce that .
Step 5. Proving that .
By Lemma 7, we can express by
[TABLE]
with , independent of , and given by Eq. (93), namely
[TABLE]
(Notice that and are independent of by construction, cf. Eqs. (97), (98) and (100), (101).) By the definition of given in Eq. (103), We deduce that:
[TABLE]
We claim that, under the assumptions of Proposition 5, and denoting (where first appears in the definition of in Eq. (96), and till now are still not determined)
[TABLE]
where and , . Before proving this claim, let us show that it allows to finish the proof of Proposition 5. Since , there exists a unit-norm vector , such that , and . Now we choose (first appears in the definition of in Eq. (96)): we set with some small enough. This yields , . Define , we have
[TABLE]
and therefore, with high probability,
[TABLE]
We are left with the task of proving that the limits in Eqs. (118), (119) exist, with the desired properties. Using Eqs. (107) and (108), we get:
[TABLE]
Using Eq. (110), we get that the limits (118), (119) exist. Further, letting , we have
[TABLE]
while, for
[TABLE]
while, for
[TABLE]
It is easy to check , and to compute the gradients, using the identity , we get
[TABLE]
Under Assumption 5, we have and completing the proof.
9 Proof of Theorem 2.(b): NT model upper bound
The proof for the NT model follows the same scheme as for the RF case. However, several steps are technically more challenging. We will follow the same notations introduced in Section 6.1. In particular will denote, respectively, expectation with respect to , , .
Let us assume that are polynomials of degree at most , i.e. .
Denote and . We introduce the operator , such that for any ,
[TABLE]
It easy to check that the adjoint operator verifies for any ,
[TABLE]
We define the operator as . For , we can write
[TABLE]
where
[TABLE]
Furthermore, we define as . For , we can write
[TABLE]
where
[TABLE]
and can be computed using the Gegenbauer recursion formula Eq. (35),
[TABLE]
with
[TABLE]
In particular, it is easy to check that
[TABLE]
We consider the subspace of corresponding to , the image of by operator . One can check that is an orthogonal basis of this subspace. Furthermore
[TABLE]
Hence this basis diagonalizes . By Eq. (44), we have
[TABLE]
By Assumption 2.(b), we have for any when is sufficiently large. Hence, the restricted inverse is well defined for sufficiently large.
Consider . We can expand the risk at parameter as
[TABLE]
Let us define and choose . We consider the expectation over of the NT risk:
[TABLE]
It is easy to check that . By Lemma 7, we have with
[TABLE]
From Assumption 2.(a) and Lemma 5.(b) applied to and , we get and . We deduce that the operator norm verifies .
Hence, there exists a constant such that
[TABLE]
Using the decomposition of in terms of harmonic polynomials (note we assumed is a degree polynomial) and Eq. (127), we have
[TABLE]
By Eq. (128), for any fixed , we have . Hence we get
[TABLE]
Hence, from the assumption that , we deduce that converges in to [math], and therefore in probability.
10 Proof of Theorem 4: risk for KR
10.1 Proof of Theorem 4
**Step 1. Rewrite the , , , matrices. **
The test error of empirical kernel ridge regression gives
[TABLE]
where , and with
[TABLE]
Let . Define
[TABLE]
Let the spherical harmonics decomposition of be
[TABLE]
and the Gegenbauer decomposition of be
[TABLE]
We decompose the vectors and matrices , , , and in terms of spherical harmonics
[TABLE]
By Proposition 3 and Eq. (56), the kernel and can be rewritten as
[TABLE]
where
[TABLE]
and
[TABLE]
Step 2. Decompose the risk
Recalling , we decompose the risk as follows
[TABLE]
where
[TABLE]
Further, we denote , , , and ,
[TABLE]
Step 3. Term
Note we have
[TABLE]
where
[TABLE]
By Lemma 13, we have
[TABLE]
hence
[TABLE]
By Lemma 11, we have (with )
[TABLE]
Moreover, we have
[TABLE]
As a result, we have
[TABLE]
By Eq. (129) again, we have
[TABLE]
By Lemma 12, we have
[TABLE]
Moreover
[TABLE]
This gives
[TABLE]
Using Cauchy Schwarz inequality for , we get
[TABLE]
As a result, combining Eqs. (130), (132) and (131), we have
[TABLE]
**Step 4. Term . **
Note we have
[TABLE]
where
[TABLE]
By Lemma 14, we have
[TABLE]
so that
[TABLE]
Using Cauchy Schwarz inequality for , and by the expression of with , we get with high probability
[TABLE]
For term , we have
[TABLE]
Note we have , and with high probability, and
[TABLE]
As a result, we have
[TABLE]
where the last equality used the fact that and Assumption 3. Combining Eqs. (134), (135) and (136), we get
[TABLE]
**Step 5. Terms and . **
By Lemma 13 again, we have
[TABLE]
By Lemma 11, we have
[TABLE]
This gives
[TABLE]
Let us consider term:
[TABLE]
Notice that by Lemma 11, Lemma 13 and the definition of , for any integer :
[TABLE]
Hence,
[TABLE]
which gives
[TABLE]
We decompose using ,
[TABLE]
where
[TABLE]
First notice that
[TABLE]
Then by Lemma 13, we get
[TABLE]
Similarly, we get
[TABLE]
By Markov’s inequality, we deduce that
[TABLE]
**Step 6. Finish the proof. **
Combining Eqs. (137), (133), (138), (139) and (140), we have
[TABLE]
which concludes the proof.
10.2 Auxiliary results
Lemma 11**.**
Let be the collection of spherical harmonics on . Let . Denote
[TABLE]
Denote , and
[TABLE]
Then as long as as , we have
[TABLE]
with and .
Proof of Lemma 11. .
Let . We can rewrite as
[TABLE]
where
[TABLE]
We use matrix Bernstein inequality. Denote . Then we have , and
[TABLE]
where we use formula (34) and the normalization . Denote . Then we have
[TABLE]
where we used and . As a result, we have for any ,
[TABLE]
Integrating the tail bound proves the lemma. ∎
Lemma 12**.**
Let be the collection of spherical harmonics on . Let . Denote
[TABLE]
Then for and , we have
[TABLE]
For , we have
[TABLE]
Proof of Lemma 12.
We have
[TABLE]
This proves the lemma. ∎
Lemma 13**.**
Let be a sequence of functions satisfying Assumption 3. Let . We have
[TABLE]
Proof of Lemma 13.
Denote
[TABLE]
Denote , and
[TABLE]
and
[TABLE]
Then we have
[TABLE]
where , and
[TABLE]
For , we have with high probability (note )
[TABLE]
To bound , let where and are orthogonal matrices, and . By Lemma 11 and the fact that , we have . Then we have
[TABLE]
where and .
For a symmetric matrix and a symmetric matrix , we have
[TABLE]
where
[TABLE]
Taking
[TABLE]
with . This gives
[TABLE]
with and .
Now we look at . We have
[TABLE]
where
[TABLE]
and
[TABLE]
and . Define
[TABLE]
We have
[TABLE]
Note we have
[TABLE]
and by Assumption 3 and we have . As long with the fact that , we have
[TABLE]
As a result, we have
[TABLE]
with . Finally, we have
[TABLE]
with . This proves the proposition. ∎
Lemma 14**.**
Let be a sequence of functions satisfying Assumption 3. Let . We have
[TABLE]
Proof of Lemma 14.
By Proposition 3, we have with . Denote the singular value decomposition , with , be two orthogonal matrices, and , with (Lemma 11). Then we have
[TABLE]
where for and for , and
[TABLE]
and
[TABLE]
By Assumption 3, we have . This proves the lemma. ∎
Acknowledgements
This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729, NSF DMS-1418362, NSF DMS-1407813.
Appendix A Numerical results with ridge regression
The reader might wonder whether the numerical results presented in Section 1.3 might change significantly if we changed the method to estimate the coefficients (for the model RF) or . Our main results –Theorem 1 and Theorem 2.(a)– predict that the result should not change qualitatively: these models are limited because they cannot approximate the target function (unless this is a low degree polynomial), regardless of the choice of the representative or .
In order to verify this prediction numerically, we repeated the experiments of Section 1.3 using ridge regression. We form a matrix containing the covariates (with for RF, and for NT), whereby for RF, and for NT. Letting , we estimate the coefficients via
[TABLE]
The results are reported in Figures 6, 7, 8, and are consistent with the ones of Section 1.3. Regularization does not help: it only reduces the peak at , as expected from [HMRT19], but not the large behavior.
(Note that for RF we do not report results for , in Fig. 6. As in Fig. 1, the resulting risk is slightly below the baseline : this effect vanishes for .)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AB 09] Martin Anthony and Peter L Bartlett, Neural network learning: Theoretical foundations , cambridge university press, 2009.
- 2[ADH + 19] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang, Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks , ar Xiv:1901.08584 (2019).
- 3[AM 15] Ahmed El Alaoui and Michael W Mahoney, Fast randomized kernel ridge regression with statistical guarantees , Advances in Neural Information Processing Systems, 2015, pp. 775–783.
- 4[AZLS 18] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization , ar Xiv:1811.03962 (2018).
- 5[Bac 13] Francis Bach, Sharp analysis of low-rank kernel matrix approximations , Conference on Learning Theory, 2013, pp. 185–209.
- 6[Bac 17a] , Breaking the curse of dimensionality with convex neural networks , The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
- 7[Bac 17b] , On the equivalence between kernel quadrature rules and random feature expansions , The Journal of Machine Learning Research 18 (2017), no. 1, 714–751.
- 8[Bar 93] Andrew R Barron, Universal approximation bounds for superpositions of a sigmoidal function , IEEE Transactions on Information theory 39 (1993), no. 3, 930–945.
