Linearized two-layers neural networks in high dimension

Behrooz Ghorbani; Song Mei; Theodor Misiakiewicz; Andrea Montanari

arXiv:1904.12191·math.ST·February 18, 2020

Linearized two-layers neural networks in high dimension

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

PDF

TL;DR

This paper analyzes the approximation capabilities of linearized two-layer neural networks in high-dimensional settings, revealing how they fit polynomial functions and relate to kernel methods under different regimes.

Contribution

It provides a rigorous characterization of the polynomial approximation limits of random feature and neural tangent kernel models in high dimensions.

Findings

01

RF fits degree-ℓ polynomials in the approximation-limited regime

02

NT fits degree-(ℓ+1) polynomials in the approximation-limited regime

03

Kernel methods are limited to degree-ℓ polynomials in the sample size-limited regime

Abstract

We consider the problem of learning an unknown function $f_{⋆}$ on the $d$ -dimensional sphere with respect to the square loss, given i.i.d. samples ${(y_{i}, x_{i})}_{i \leq n}$ where $x_{i}$ is a feature vector uniformly distributed on the sphere and $y_{i} = f_{⋆} (x_{i}) + ε_{i}$ . We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$ . We consider two specific regimes: the approximation-limited regime, in…

Equations857

\displaystyle{\mathcal{F}}_{{\sf NN}}\equiv\Big{\{}f({\bm{x}})=\sum_{i=1}^{N}a_{i}\,\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle)\;:\;\;\;a_{i}\in{\mathbb{R}},{\bm{w}}_{i}\in{\mathbb{R}}^{d}\;\,\,\,\forall i\leq N\Big{\}}\,.

\displaystyle{\mathcal{F}}_{{\sf NN}}\equiv\Big{\{}f({\bm{x}})=\sum_{i=1}^{N}a_{i}\,\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle)\;:\;\;\;a_{i}\in{\mathbb{R}},{\bm{w}}_{i}\in{\mathbb{R}}^{d}\;\,\,\,\forall i\leq N\Big{\}}\,.

f_{NN} (x)

f_{NN} (x)

\approx f_{NN, 0} (x) + i = 1 \sum N (a_{i} - a_{0, i}) σ (⟨ w_{0, i}, x ⟩) + i = 1 \sum N a_{0, i} ⟨ w_{i} - w_{0, i}, x ⟩ σ^{'} (⟨ w_{0, i}, x ⟩),

F_{RF} (W)

F_{RF} (W)

F_{NT} (W)

C_{1} (d) N^{- r / (d - 1)} \leq f \in W_{2}^{r} sup \hat{f} \in F_{NN} in f E {[f (x) - \hat{f} (x)]^{2}} \leq C_{2} (d) N^{- r / (d - 1)} .

C_{1} (d) N^{- r / (d - 1)} \leq f \in W_{2}^{r} sup \hat{f} \in F_{NN} in f E {[f (x) - \hat{f} (x)]^{2}} \leq C_{2} (d) N^{- r / (d - 1)} .

\displaystyle R_{{\sf M}}(f_{\star},{\bm{W}})=\inf_{f\in{\mathcal{F}}_{{\sf M}}({\bm{W}})}\mathbb{E}\big{[}(f_{\star}({\bm{x}})-f({\bm{x}}))^{2}\big{]}\,,\;\;\;\;{\sf M}\in\{{\sf RF},{\sf NT}\}\,.

\displaystyle R_{{\sf M}}(f_{\star},{\bm{W}})=\inf_{f\in{\mathcal{F}}_{{\sf M}}({\bm{W}})}\mathbb{E}\big{[}(f_{\star}({\bm{x}})-f({\bm{x}}))^{2}\big{]}\,,\;\;\;\;{\sf M}\in\{{\sf RF},{\sf NT}\}\,.

\Big{[}d^{\ell}\cdot\min_{k\leq\ell}\lambda_{d,k}(\sigma_{d})^{2}\Big{]}/\|\sigma_{d}(\langle{\bm{e}},\cdot\rangle)\|_{L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))}^{2}=\Omega_{d}(1),

\Big{[}d^{\ell}\cdot\min_{k\leq\ell}\lambda_{d,k}(\sigma_{d})^{2}\Big{]}/\|\sigma_{d}(\langle{\bm{e}},\cdot\rangle)\|_{L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))}^{2}=\Omega_{d}(1),

\displaystyle\Big{|}R_{{\sf RF}}(f_{d},{\bm{W}})-R_{{\sf RF}}({\mathsf{P}}_{\leq\ell}f_{d},{\bm{W}})-\|{\mathsf{P}}_{>\ell}f_{d}\|_{L^{2}}^{2}\Big{|}

\displaystyle\Big{|}R_{{\sf RF}}(f_{d},{\bm{W}})-R_{{\sf RF}}({\mathsf{P}}_{\leq\ell}f_{d},{\bm{W}})-\|{\mathsf{P}}_{>\ell}f_{d}\|_{L^{2}}^{2}\Big{|}

0 \leq R_{RF} (P_{\leq ℓ} f_{d}, W) \leq ε ∥ P_{\leq ℓ} f_{d} ∥_{L^{2}}^{2} .

0 \leq R_{RF} (P_{\leq ℓ} f_{d}, W) \leq ε ∥ P_{\leq ℓ} f_{d} ∥_{L^{2}}^{2} .

R_{RF} (f_{d}, W) \approx R_{RF} (P_{\leq ℓ} f_{d}, W) + ∥ P_{> ℓ} f_{d} ∥_{L^{2}}^{2} .

R_{RF} (f_{d}, W) \approx R_{RF} (P_{\leq ℓ} f_{d}, W) + ∥ P_{> ℓ} f_{d} ∥_{L^{2}}^{2} .

R_{RF} (f_{d}, W) = ∥ P_{> ℓ} f_{d} ∥_{L^{2}}^{2} + ∥ f_{d} ∥_{L^{2}}^{2} \cdot o_{d, P} (1) .

R_{RF} (f_{d}, W) = ∥ P_{> ℓ} f_{d} ∥_{L^{2}}^{2} + ∥ f_{d} ∥_{L^{2}}^{2} \cdot o_{d, P} (1) .

\frac{μ _{k_{1}} ( x ^{2} σ ^{'} )}{μ _{k_{1}} ( σ ^{'} )} \neq = \frac{μ _{k_{2}} ( x ^{2} σ ^{'} )}{μ _{k_{2}} ( σ ^{'} )} .

\frac{μ _{k_{1}} ( x ^{2} σ ^{'} )}{μ _{k_{1}} ( σ ^{'} )} \neq = \frac{μ _{k_{2}} ( x ^{2} σ ^{'} )}{μ _{k_{2}} ( σ ^{'} )} .

\displaystyle\Big{|}R_{{\sf NT}}(f_{d},{\bm{W}})-R_{{\sf NT}}({\mathsf{P}}_{\leq\ell+1}f_{d},{\bm{W}})-\|{\mathsf{P}}_{>\ell+1}f_{d}\|_{L^{2}}^{2}\Big{|}

\displaystyle\Big{|}R_{{\sf NT}}(f_{d},{\bm{W}})-R_{{\sf NT}}({\mathsf{P}}_{\leq\ell+1}f_{d},{\bm{W}})-\|{\mathsf{P}}_{>\ell+1}f_{d}\|_{L^{2}}^{2}\Big{|}

0 \leq R_{NT} (P_{\leq ℓ + 1} f_{d}, W) \leq ε ∥ P_{\leq ℓ + 1} f_{d} ∥_{L^{2}}^{2} .

0 \leq R_{NT} (P_{\leq ℓ + 1} f_{d}, W) \leq ε ∥ P_{\leq ℓ + 1} f_{d} ∥_{L^{2}}^{2} .

μ_{k} (σ^{'}) = \frac{( - 1 ) ^{(k - 1) /2}}{2 π} (k - 2)!! 1_{k \mbox odd} .

μ_{k} (σ^{'}) = \frac{( - 1 ) ^{(k - 1) /2}}{2 π} (k - 2)!! 1_{k \mbox odd} .

R_{RF} (σ; W) = ∥ σ_{> ℓ} ∥_{L^{2} (R, γ)}^{2} + o_{d, P} (1), R_{NT} (σ; W) = ∥ σ_{> ℓ + 1} ∥_{L^{2} (R, γ)}^{2} + o_{d, P} (1) .

R_{RF} (σ; W) = ∥ σ_{> ℓ} ∥_{L^{2} (R, γ)}^{2} + o_{d, P} (1), R_{NT} (σ; W) = ∥ σ_{> ℓ + 1} ∥_{L^{2} (R, γ)}^{2} + o_{d, P} (1) .

\displaystyle H^{{\sf RF}}_{d}\big{(}{\bm{x}}_{1},{\bm{x}}_{2}\big{)}:=h^{{\sf RF}}_{d}\big{(}\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/d\big{)}=\mathbb{E}\{\sigma(\langle{\bm{w}},{\bm{x}}_{1}\rangle)\sigma({\bm{w}},{\bm{x}}_{2}\rangle)\big{\}}\,.

\displaystyle H^{{\sf RF}}_{d}\big{(}{\bm{x}}_{1},{\bm{x}}_{2}\big{)}:=h^{{\sf RF}}_{d}\big{(}\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/d\big{)}=\mathbb{E}\{\sigma(\langle{\bm{w}},{\bm{x}}_{1}\rangle)\sigma({\bm{w}},{\bm{x}}_{2}\rangle)\big{\}}\,.

\displaystyle H^{{\sf NT}}_{d}\big{(}{\bm{x}}_{1},{\bm{x}}_{2}\big{)}:=h^{{\sf NT}}_{d}\big{(}\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/d\big{)}=(\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/d)\mathbb{E}\{\sigma^{\prime}(\langle{\bm{w}},{\bm{x}}_{1}\rangle)\sigma^{\prime}({\bm{w}},{\bm{x}}_{2}\rangle)\big{\}}\,.

\displaystyle H^{{\sf NT}}_{d}\big{(}{\bm{x}}_{1},{\bm{x}}_{2}\big{)}:=h^{{\sf NT}}_{d}\big{(}\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/d\big{)}=(\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/d)\mathbb{E}\{\sigma^{\prime}(\langle{\bm{w}},{\bm{x}}_{1}\rangle)\sigma^{\prime}({\bm{w}},{\bm{x}}_{2}\rangle)\big{\}}\,.

H_{d} (x_{1}, x_{2}) = h_{d} (⟨ x_{1}, x_{2} ⟩ / d),

H_{d} (x_{1}, x_{2}) = h_{d} (⟨ x_{1}, x_{2} ⟩ / d),

\hat{f}_{λ} = ar g f min {i = 1 \sum n ℓ (y_{i}, f (x_{i})) + λ ∥ f ∥_{H}^{2}},

\hat{f}_{λ} = ar g f min {i = 1 \sum n ℓ (y_{i}, f (x_{i})) + λ ∥ f ∥_{H}^{2}},

\hat{f}_{λ} (x) = i = 1 \sum n \overset{a}{^}_{i} h_{d} (⟨ x, x_{i} ⟩ / d) .

\hat{f}_{λ} (x) = i = 1 \sum n \overset{a}{^}_{i} h_{d} (⟨ x, x_{i} ⟩ / d) .

\displaystyle R_{H}(f_{\star},{\bm{X}})\equiv\min_{{\bm{a}}}\mathbb{E}_{\bm{x}}\Big{\{}\Big{(}f_{\star}({\bm{x}})-\sum_{i=1}^{n}a_{i}h_{d}(\langle{\bm{x}}_{i},{\bm{x}}\rangle/d)\Big{)}^{2}\Big{\}}.

\displaystyle R_{H}(f_{\star},{\bm{X}})\equiv\min_{{\bm{a}}}\mathbb{E}_{\bm{x}}\Big{\{}\Big{(}f_{\star}({\bm{x}})-\sum_{i=1}^{n}a_{i}h_{d}(\langle{\bm{x}}_{i},{\bm{x}}\rangle/d)\Big{)}^{2}\Big{\}}.

R_{H} (f_{d}, X) - R_{H} (P_{\leq ℓ} f_{d}, X) - ∥ P_{> ℓ} f_{d} ∥_{L^{2}}^{2} \leq ε ∥ f_{d} ∥_{L^{2}} ∥ P_{> ℓ} f_{d} ∥_{L^{2}} .

R_{H} (f_{d}, X) - R_{H} (P_{\leq ℓ} f_{d}, X) - ∥ P_{> ℓ} f_{d} ∥_{L^{2}}^{2} \leq ε ∥ f_{d} ∥_{L^{2}} ∥ P_{> ℓ} f_{d} ∥_{L^{2}} .

\hat{a} = (H + λ I_{n})^{- 1} y,

\hat{a} = (H + λ I_{n})^{- 1} y,

H_{ij} = h_{d} (⟨ x_{i}, x_{j} ⟩ / d),

H_{ij} = h_{d} (⟨ x_{i}, x_{j} ⟩ / d),

\hat{f}_{λ} (x) = y^{T} (H + λ I_{n})^{- 1} h (x),

\hat{f}_{λ} (x) = y^{T} (H + λ I_{n})^{- 1} h (x),

h (x) = [h_{d} (⟨ x, x_{1} ⟩ / d), \dots, h_{d} (⟨ x, x_{n} ⟩ / d)]^{T} .

h (x) = [h_{d} (⟨ x, x_{1} ⟩ / d), \dots, h_{d} (⟨ x, x_{n} ⟩ / d)]^{T} .

R_{KR} (f_{d}, X, λ) \equiv

R_{KR} (f_{d}, X, λ) \equiv

\displaystyle\xi_{d,k}(h_{d})=\int_{[-\sqrt{d},\sqrt{d}]}h_{d}\big{(}x/\sqrt{d}\big{)}Q_{k}^{(d)}(\sqrt{d}x)\tau^{1}_{d-1}({\rm d}x),\,

\displaystyle\xi_{d,k}(h_{d})=\int_{[-\sqrt{d},\sqrt{d}]}h_{d}\big{(}x/\sqrt{d}\big{)}Q_{k}^{(d)}(\sqrt{d}x)\tau^{1}_{d-1}({\rm d}x),\,

\frac{d ^{ℓ} min _{k \leq ℓ} ξ _{d, k} ( h _{d} )}{\sum _{k \geq ℓ + 1} ξ _{d, k} ( h _{d} ) B ( d , k )} \geq c_{ℓ} .

\frac{d ^{ℓ} min _{k \leq ℓ} ξ _{d, k} ( h _{d} )}{\sum _{k \geq ℓ + 1} ξ _{d, k} ( h _{d} ) B ( d , k )} \geq c_{ℓ} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Linearized two-layers neural networks in high dimension

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari Department of Electrical Engineering, Stanford UniversityInstitute for Computational and Mathematical Engineering, Stanford UniversityDepartment of Statistics, Stanford UniversityDepartment of Electrical Engineering and Department of Statistics, Stanford University

Abstract

We consider the problem of learning an unknown function $f_{\star}$ on the $d$ -dimensional sphere with respect to the square loss, given i.i.d. samples $\{(y_{i},{\bm{x}}_{i})\}_{i\leq n}$ where ${\bm{x}}_{i}$ is a feature vector uniformly distributed on the sphere and $y_{i}=f_{\star}({\bm{x}}_{i})+\varepsilon_{i}$ . We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$ .

We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime, we prove that if $d^{\ell+\delta}\leq N\leq d^{\ell+1-\delta}$ for small $\delta>0$ , then RF effectively fits a degree- $\ell$ polynomial in the raw features, and NT fits a degree- $(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{\ell+\delta}\leq n\leq d^{\ell+1-\delta}$ , then kernel methods can fit at most a a degree- $\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.

1 Introduction and main results
1.1 Background
1.2 A parenthesis
1.3 A numerical experiment
1.4 Summary of main results
2 Approximation error of linearized neural networks
2.1 Approximation error of random features models
2.2 Approximation error of neural tangent models
2.3 Separation between NN and RF, NT
3 Generalization error of kernel methods
3.1 Lower bound for general kernel methods
3.2 Upper bound for kernel ridge regression
3.3 Separation between kernel methods and neural networks
3.4 Near-optimality of interpolators
3.5 A conjecture for generalization error of random features model
4 Further related work
5 Technical background
5.1 Functional spaces over the sphere
5.2 Gegenbauer polynomials
5.3 Hermite polynomials
5.4 Notations
6 Proof of Theorem 1.(a): RF model lower bound
6.1 Proof of Theorem 1.(a): Outline
6.2 Proof of Proposition 1
6.3 Proof of Proposition 2
6.4 Proof of Proposition 3
7 Proof of Theorem 1.(b): RF model upper bound
8 Proof of Theorem 2.(a): NT model lower bound
8.1 Preliminaries
8.2 Proof of Theorem 2.(a): Outline
8.3 Proof of Proposition 4
8.4 Proof of Proposition 5
8.4.1 Auxiliary lemmas
8.4.2 Proof of Proposition 5
9 Proof of Theorem 2.(b): NT model upper bound
10 Proof of Theorem 4: risk for KR
10.1 Proof of Theorem 4
10.2 Auxiliary results
A Numerical results with ridge regression

1 Introduction and main results

In the canonical statistical learning problem, we are given independent and identically distributed (i.i.d.) pairs $(y_{i},{\bm{x}}_{i})$ , $1\leq i\leq n$ , where ${\bm{x}}_{i}\in{\mathbb{R}}^{d}$ is a feature vector and $y_{i}\in{\mathbb{R}}$ is a label or response variable. We would like to construct a function $f$ which allows us to predict future responses. Throughout this paper, we will measure the quality of a predictor $f$ via its square prediction error (risk): $R(f)\equiv\mathbb{E}\{(y-f({\bm{x}}))^{2}\}$ .

1.1 Background

For a number of important applications, state-of-the-art performances are obtained by representing the function $f$ as a multi-layers neural network. The simplest model in this class is given by two-layers networks (NN):

[TABLE]

Here $N$ is the number of neurons and $\sigma:{\mathbb{R}}\to{\mathbb{R}}$ is an activation function.

Two-layers neural networks have been extensively studied in the nineties, with a focus on two goals: $(i)$ Establishing approximation guarantees over classical function spaces; $(ii)$ Controlling the generalization error via Rademacher complexity arguments. We refer to [Pin99, AB09] for surveys of these results.

Computational aspects were notably under-represented within these early theoretical contributions. On the contrary, it is nowadays increasingly clear that computational and statistical aspects cannot be separated in the analysis of neural networks (see, e.g. [SHN*+*18, MMN18, CB18]). Indeed, the optimization algorithm does not simply compute the unique minimizer of a regularized empirical risk: it instead selects one among many possible near-minimizers, whose generalization properties can vary significantly. Therefore, the specific optimization algorithm is an integral part of the definition of the regularization method.

A concrete scenario in which this interplay can be understood precisely is the so-called ‘neural tangent kernel’ regime. First explicitly described in [JGH18], this regime has attracted considerable amount of work. The basic idea is that, for highly overparametrized networks, the network weights barely change from their random initialization. We can therefore replace the nonlinear function class ${\mathcal{F}}_{{\sf NN}}$ by its first order Taylor expansion around this initialization.

Denoting by $(a_{0,i},{\bm{w}}_{0,i})_{i\leq N}$ the weights at initialization, a first order Taylor expansion yields

[TABLE]

where $f_{{\sf NN},0}$ is the neural network at initialization. In other words, $f_{{\sf NN}}-f_{{\sf NN},0}$ is a function in the direct sum ${\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}})$ , where we defined

[TABLE]

Here ${\bm{W}}\in{\mathbb{R}}^{N\times d}$ is a matrix whose $i$ -th row is the vector ${\bm{w}}_{i}$ , and $\sigma^{\prime}$ is the derivative of the activation function with respect to its argument (if $\langle{\bm{w}}_{i},{\bm{x}}\rangle$ has a density, $\sigma$ only needs to be weakly differentiable).

We will refer to ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ as the ‘random features’ (RF) model: it amounts to fixing the first layer, and only optimizing the coefficients in the second layer. Equivalently, ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ corresponds to the first order Taylor expansion of $f_{{\sf NN}}$ with respect to the second layer weights $(a_{i})_{i\leq N}$ . This model can be traced back to the work of Neal [Nea96], and was successfully developed by Rahimi and Recht [RR08] as a randomized approximation to kernel methods.

The second function class ${\mathcal{F}}_{{\sf NT}}({\bm{W}})$ corresponds to the first order Taylor expansion of $f_{{\sf NN}}$ with respect to the first layer weights $({\bm{w}}_{i})_{i\leq N}$ [JGH18]. We will refer to ${\mathcal{F}}_{{\sf NT}}({\bm{W}})$ as the neural tangent class111Often the term ‘neural tangent’ is reserved for the direct sum ${\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}})$ . We find it more convenient to give distinct names to each of the two terms, especially since ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ has much smaller dimension than ${\mathcal{F}}_{{\sf NT}}({\bm{W}})$ for large $d$ ..

A sequence of recent papers proves that, in a certain overparametrized regime, gradient descent (GD) applied to the nonlinear neural network class ${\mathcal{F}}_{{\sf NN}}$ effectively converges to a model in ${\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}})$ . Namely, if the number of neurons $N$ is larger than a threshold $N_{0}(n,d)$ , and training is initialized with $f_{0}({\bm{x}})=N^{-1/2}\sum_{i=1}^{N}a_{0,i}\,\sigma(\langle{\bm{w}}_{0,i},{\bm{x}}\rangle)$ where $\{(a_{0,i},{\bm{w}}_{0,i})\}_{i\leq N}\sim_{iid}{\sf N}(0,1)\otimes{\sf N}(0,{\mathbf{I}}_{d}/d)$ , then gradient descent converges exponentially fast to weights $\{(a_{i},{\bm{w}}_{i})\}_{i\leq N}$ such that $f-f_{0}$ is well approximated by a function in ${\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}})$ . The specific value of the threshold $N_{0}(n,d)$ for the onset of this NT regime has been steadily pushed down over the last year [DZPS18, DLL*+*18, AZLS18, ZCZG18, ADH*+*19].

Does the NT regime explain the power of multi-layers neural networks, when trained by gradient descent methods? From an empirical point of view, the evidence is not univocal [LXS*+*19, GSJW19, COB19]. From a theoretical point of view, while the expressivity of neural networks is superior to the one of NT models, this hypothesis is not easy to dismiss for at least two reasons. First, neural networks learned by gradient descent algorithms form a significantly smaller class than general networks. Second, the answer depends on the data distribution, the target function $f_{*}$ and the sample size.

In order to clarify this question, we explore the behavior of RF and NT models in the high-dimensional setting. More precisely, we consider two specific asymptotic regimes:

$(i)$

The infinite sample size case in which $n=\infty$ , and $N,d$ diverge while being polynomially related. In this case the prediction error reduces to the approximation error $\inf_{f\in{\mathcal{F}}_{{\sf M}}}\mathbb{E}\{[f_{*}({\bm{x}})-f({\bm{x}})]^{2}\}$ , for either model ${\sf M}\in\{{\sf NT},{\sf RF}\}$ . 2. $(ii)$

The infinite width regime in which $N=\infty$ and $n,d$ diverge while being polynomially related. In this case (and under a suitable bound on the $\ell_{2}$ norm of the coefficients) both classes ${\mathcal{F}}_{\sf RF}$ , ${\mathcal{F}}_{\sf NT}$ reduce to certain reproducing kernel Hilbert spaces (RKHS).

In both cases we obtain sharp results, up to errors vanishing as $d\to\infty$ . Crucially, our results hold pointwise, i.e. they provide a characterization of approximation and generalization error which hold for a given function $f_{*}$ . This allows us to derive precise separation results between NN and NT models.

1.2 A parenthesis

The approximation properties of neural networks have been studied for over three decades [DHM89, Cyb89, Hor91, Bar93, MM94, GJP95, Mha96, Pet98, Mai99, Pin99]. It is useful to discuss the relation between the questions outlined above and existing literature.

A number of results are available on the approximation of functions in certain smoothness classes by two-layers neural networks. In particular [Bar93] controls smoothness by the average frequency content in the Fourier transform (the ‘Barron norm’), while [Mha96, Pet98, Mai99] use classical Sobolev norms. For instance [Mai99] proves that $N$ -neurons NN approximate functions in the Sobolev ball $W^{r}_{2}$ with worst case error

[TABLE]

for some unspecified functions $C_{1},C_{2}$ . (Similar results are found in [Pet98].) These results cannot be used for our purposes.

First of all, we are interested in the NT class which is potentially much less powerful than NN.

Second, bounds of the type (1) make it hard to prove separation results between NN and NT. In order to prove such a separation, we would have to prove that neural networks trained by gradient descent have good approximation properties, uniformly over Sobolev balls. This objective is currently out of reach. Our pointwise approximation results make it much easier to prove separation statements.

Third, earlier work neglects polynomial dependencies in $d$ . Bounds of the type (1) have weak implications when both $d$ and $N$ are large, say $d=100$ , $N=10^{6}$ . We will instead prove sharp asymptotic results that are valid in this regime. As illustrated in the next section, our analysis captures the actual behavior in a quantitative manner, already when $d\geq 100$ .

Quantitative results in the high-dimensional regime have been proved only recently. In particular, Bach [Bac17b] established quantitative upper and lower bounds for the approximation error in the RF model. However, these results do not have direct implications on the NT model which is our main interest here. Further, lower bounds in [Bac17b] are, as before, worst case over a certain RKHS. (See also [Bac13, AM15, RR17] for related work.)

Similar considerations apply to the generalization error of kernel methods. While this is a classical topic [CST*+*00, CDV07, RR17, LR18], earlier work proves minimax upper and lower bounds. Establishing pointwise lower bounds is instead important in order to understand precisely the separation between neural networks and their linearized counterparts. We refer to Section 4 for further discussion of related work.

1.3 A numerical experiment

In order to illustrate the approximation behavior of RF and NT models, we present a simple simulation study. We consider feature vectors normalized so that $\|{\bm{x}}_{i}\|_{2}^{2}=d$ , and otherwise uniformly random, and responses $y_{i}=f_{\star}({\bm{x}}_{i})$ , for a certain function $f_{\star}$ . Indeed, this will be the setting throughout the paper: ${\bm{x}}_{i}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ (where $\mathbb{S}^{d-1}(r)$ denotes the sphere with radius $r$ in $d$ dimensions) and $f_{\star}:\mathbb{S}^{d-1}(\sqrt{d})\to{\mathbb{R}}$ . We draw random weights $({\bm{w}}_{i})_{i\leq N}\sim_{iid}{\sf Unif}(\mathbb{S}^{d-1}(1))$ . We use $n$ samples to fit a model in ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ or ${\mathcal{F}}_{{\sf NT}}({\bm{W}})$ . We learn the model parameters using least squares. If the model is overparametrized, we select the minimum $\ell_{2}$ -norm solution. (We refer to Appendix A for simulations using ridge regression instead.) We estimate the risk (test error) using $n_{\mbox{\tiny\rm test}}=1500$ fresh samples, and normalize it by the risk of the trivial model $R_{0}=\mathbb{E}\{f_{\star}({\bm{x}})^{2}\}$ .

Figures 1, 2, 3 report the results of such a simulation using RF –for Figure 1– and NT –for Figures 2 and 3. We use shifted ReLU activations $\sigma(u)=\max(u-u_{0},0)$ , $u_{0}=0.5$ . The choice of $u_{0}=0.5$ is not essential: (Lebesgue-)almost every $u_{0}\neq 0$ has similar behavior. In contrast, the case $u_{0}=0$ is degenerate because $\max(u,0)$ is equal to a linear function plus an even function.

The target functions $f_{\star}$ in these examples are quite simple. Figures 1 and 2 use a quadratic function $f_{\star,2}({\bm{x}})=\sum_{i\leq\lfloor d/2\rfloor}x_{i}^{2}-\sum_{i>\lfloor d/2\rfloor}x_{i}^{2}$ . In Figure 3, the target function is a third-order polynomial $f_{\star,3}({\bm{x}})=\sum_{i=1}^{d}(x_{i}^{3}-3x_{i})$ .

The results are somewhat disappointing: in two cases (first and third figures) RF and NT models do not beat the trivial predictor. In one case (the second one), the NT model surpasses the trivial baseline, and it appears to decrease to [math] as the number of samples $n$ increase. We also note that the risk shows a cusp when $n\approx p$ , with $p$ the number of parameters ( $p=N$ for RF, and $p=Nd$ for NT). This phenomenon is related to overparametrization, and will not be discussed further in this paper (see [BHMM18, BHX19, HMRT19, MM19] for relevant work). We will instead focus on the population behavior $n\to\infty$ .

In other words, the RF model does not appear to be able to learn a simple quadratic function, and the NT model does not appear to be able to learn a third order polynomial. Our main theorems (presented in the next sections) capture in a precise manner this behavior. In particular,

•

We will prove that for $N=O_{d}(d^{2-\delta})$ , RF does not outperform the trivial predictor on any function that has vanishing projection on linear functions. Similarly, NT does not outperform the trivial predictor on any function that has vanishing projection on linear and quadratic functions.

•

In contrast, there exists neural networks in ${\mathcal{F}}_{{\sf NN}}$ with $N=O_{d}(d)$ neurons, and a small approximation error both for $f_{\star,2}$ and $f_{\star,3}$ (see, e.g., [Bac17b], or [MMN18, Proposition 1]).

These two points illustrate the gap in approximation power between NT (or RF) and NN.

We demonstrate the second point empirically in Fig. 4 by choosing weight vectors ${\bm{w}}_{i}=s_{i}{\bm{e}}_{r(i)}$ , where $r(i)\sim{\sf Unif}([n])$ are i.i.d. uniformly random indices, and the scaling factor is $s_{i}\sim{\sf N}(0,1)$ . Fixing these random first-layer weights, we fit the second-layer weights $a_{i}$ by least squares. The risk achieved is an upper bound on the minimum risk in the ${\sf NN}$ model, namely $R_{{\sf NN}}(f_{\star})\equiv\inf_{f\in{\mathcal{F}}_{{\sf NN}}}\mathbb{E}\{(f_{\star}({\bm{x}})-f({\bm{x}}))^{2}\}$ , and is significantly smaller than the baseline $R_{0}$ . (The risk reported in Fig. 4 can also be interpreted as a ‘random features’ risk. However, the specific distribution of the vectors ${\bm{w}}_{i}$ is tailored to the function $f_{\star}$ , and hence not achievable within the RF model.)

1.4 Summary of main results

Approximation error of RF models.

If $d^{1+\delta}<N\leq d^{2-\delta}$ for some $\delta>0$ , then the approximation error of RF is asymptotically equivalent to the approximation error of fitting a linear function in the raw covariates ${\bm{x}}$ (i.e. least squares with the model $f({\bm{x}})=b_{0}+\langle{\bm{\beta}},{\bm{x}}\rangle$ , $b_{0}\in{\mathbb{R}}$ , ${\bm{\beta}}\in{\mathbb{R}}^{d}$ ). More generally, if $d^{\ell+\delta}\leq N\leq d^{\ell+1-\delta}$ , then RF is equivalent to fitting a linear function over all monomials of degree at most $\ell$ in ${\bm{x}}$ .

The equivalence between RF regression and polynomial regression holds pointwise for target function $f_{\star}$ .

Approximation error of NT models.

If $d^{1+\delta}\leq N\leq d^{2-\delta}$ , then the approximation error of NT is asymptotically equivalent to the approximation error of fitting a linear function over monomials of degree at most two in ${\bm{x}}$ (i.e. least squares with the model $f({\bm{x}})=b_{0}+\langle{\bm{\beta}},{\bm{x}}\rangle+\langle{\bm{x}},{\bm{B}}{\bm{x}}\rangle$ , $b_{0}\in{\mathbb{R}}$ , ${\bm{\beta}}\in{\mathbb{R}}^{d}$ , ${\bm{B}}\in{\mathbb{R}}^{d\times d}$ ). More generally, if $d^{\ell+\delta}\leq N\leq d^{\ell+1-\delta}$ , then NT is equivalent to fitting a linear function over all monomials of degree at most $\ell+1$ in ${\bm{x}}$ .

Again, this result holds pointwise over the choice of $f_{\star}$ .

Generalization error of kernel methods.

We study the generalization error of kernel methods under the same data distribution described above, for any rotationally invariant kernel on the sphere $\mathbb{S}^{d-1}(\sqrt{d})$ . We prove two results:

1.

If the sample size is $n\leq d^{\ell+1-\delta}$ , then the generalization error of any kernel method is lower bounded by the approximation error of linear regression over monomials of degree at most $\ell$ in ${\bm{x}}$ . 2. 2.

If the sample size satisfies $d^{\ell+\delta}\leq n\leq d^{\ell+1-\delta}$ , then the generalization error of Kernel Ridge Regression (KRR) is given by the approximation error of linear regression over monomials of degree at most $\ell$ in ${\bm{x}}$ .

It is worth emphasizing two aspects of this last result. The first one is its generality. The NT kernel associated to an infinitely wide multi-layers fully connected neural network is always rotational invariant (assuming an i.i.d. Gaussian initialization of weights, which is common in practice). Therefore –in the NT regime– multi-layers neural networks cannot outperform the trivial predictor on a target function $f_{\star}({\bm{x}})$ that has vanishing projection onto degree- $\ell$ polynomials, unless the sample size satisfies $n\geq d^{\ell+1-\delta}$ . (For instance, they cannot outperform the trivial predictor for $f_{\star}({\bm{x}})=x_{1}^{3}-3x_{1}$ unless $n\geq d^{3-\delta}$ .)

The second aspect can be summarized as follows.

Optimality of near interpolators.

For $d^{\ell+\delta}\leq n\leq d^{\ell+1-\delta}$ , the ideal behavior of KRR is achieved for all regularization values $\lambda\leq\lambda_{*}$ , with $\lambda_{*}$ depending on $N,d$ and the activation function. In particular, it is achieved by ‘near interpolators’ (corresponding to $\lambda\approx 0$ ) i.e. functions $\hat{f}$ that have negligible training error.

2 Approximation error of linearized neural networks

In this section, we state formally our results about the approximation error of ${\sf RF}$ and ${\sf NT}$ models. We define the minimum population error for any of the models ${\sf M}\in\{{\sf RF},{\sf NT}\}$ by

[TABLE]

Notice that this is a random variable because of the random features encoded in the matrix ${\bm{W}}\in{\mathbb{R}}^{N\times d}$ . Also, it depends implicitly on $d,N$ , but we will make this dependence explicit only when necessary.

For $\ell\in{\mathbb{N}}$ , we denote by ${\mathsf{P}}_{\leq\ell}:L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))\to L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ the orthogonal projector onto the subspace of polynomials of degree at most $\ell$ . (We also let ${\mathsf{P}}_{>\ell}={\mathbf{I}}-{\mathsf{P}}_{\leq\ell}$ .) In other words, ${\mathsf{P}}_{\leq\ell}f$ is the function obtained by linear regression of $f$ onto monomials of degree at most $\ell$ . Throughout this paper ‘with high probability’ means ‘with probability converging to one as $d,N\to\infty$ ’. The notations $s_{d}=\omega_{d}(t_{d})$ , $s_{d}=o_{d}(t_{d})$ , $s_{d}=O_{d}(t_{d})$ , $s_{d}=\Omega_{d}(t_{d})$ mean, respectively, $\lim_{d\to\infty}|s_{d}/t_{d}|=\infty$ , $\lim_{d\to\infty}|s_{d}/t_{d}|=0$ , $\lim\sup_{d\to\infty}|s_{d}/t_{d}|<\infty$ , $\lim\inf_{d\to\infty}|s_{d}/t_{d}|>0$ . Given random variables $X_{d}$ , and deterministic quantities $t_{d}$ , we write $X_{d}=o_{d,\mathbb{P}}(t_{d})$ (and so on) if the above holds in probability.

2.1 Approximation error of random features models

Assumption 1 (Assumptions for the RF model at level $\ell\in{\mathbb{N}}$ ).

Let $\{\sigma_{d}\}_{d\geq 1}$ be a sequence of functions $\sigma_{d}:{\mathbb{R}}\to{\mathbb{R}}$ .

(a)

$\sigma_{d}\in L^{2}([-\sqrt{d},\sqrt{d}],\tau^{1}_{d-1})$ , where $\tau^{1}_{d-1}$ is the distribution of $\langle{\bm{x}},{\bm{e}}\rangle$ for ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , and ${\bm{e}}=(1,0,\ldots,0)^{\mathsf{T}}\in\mathbb{R}^{d}$ .

(b)

We have

[TABLE]

where $\lambda_{d,k}(\sigma_{d})=\langle\sigma_{d}(\langle{\bm{e}},\cdot\rangle),Q_{k}(\sqrt{d}\langle{\bm{e}},\cdot\rangle)\rangle_{L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))}$ , and $Q_{k}$ is the $k$ -th Gegenbauer polynomial (see Section 5).

Theorem 1 (Risk of the RF model).

Let $\{f_{d}\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))\}_{d\geq 1}$ be a sequence of functions. Let ${\bm{W}}=({\bm{w}}_{i})_{i\in[N]}$ with $({\bm{w}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1})$ independently. Then the following hold.

(a)

Assume $N\leq d^{\ell+1-\delta_{d}}$ for a fixed integer $\ell$ and any sequence $\delta_{d}$ such that $\delta_{d}^{2}\log d\to\infty$ (in particular, $N\leq d^{\ell+1-\delta}$ is sufficient for any fixed $\delta>0$ ). Let $\{\sigma_{d}\}_{d\geq 1}$ satisfy Assumption 1.(a). Then, for any $\varepsilon>0$ , the following holds with high probability:

[TABLE]

(b)

Assume $N=\omega_{d}(d^{\ell})$ for some integer $\ell$ , and $\{\sigma_{d}\}_{d\geq 1}$ satisfy Assumption 1.(b) at level $\ell$ . Then for any $\varepsilon>0$ , the following holds with high probability:

[TABLE]

See Section 6 for the proof of lower bound, and Section 7 for the proof of upper bound.

In words, Eq. (3) amounts to say that when $N=O_{d}(d^{\ell+1-\delta_{d}})$ , the risk of the random feature model can be approximately decomposed in two parts, each non-negative, and each with a simple interpretation:

[TABLE]

The second contribution, $\|{\mathsf{P}}_{>\ell}f_{d}\|_{L^{2}}^{2}$ is simply the risk achieved by linear regression with respect to polynomials of degree at most $\ell$ . The first contribution $R_{{\sf RF}}({\mathsf{P}}_{\leq\ell}f_{d},{\bm{W}})$ is the risk of the RF model when applied to the low-degree component of $f_{d}$ . Equation (4) implies that when $N=\omega_{d}(d^{\ell})$ , the first contribution $R_{{\sf RF}}({\mathsf{P}}_{\leq\ell}f_{d},{\bm{W}})$ vanishes asymptotically.

If both Assumptions 1. $(a)$ and 1. $(b)$ hold and $\omega_{d}(d^{\ell})\leq N\leq O_{d}(d^{\ell+1-\delta})$ for some integer $\ell$ , we thus obtain

[TABLE]

In particular, this shows that RF fits a linear function over polynomials of maximum degree $\ell$ .

Remark 2.1.

Note that Theorem 1. $(a)$ holds under very weak conditions on the activation function, which may depend on the dimension $d$ . The condition $\sigma_{d}(\langle{\bm{e}}_{1},\cdot\rangle)\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ can also be rewritten as $\sigma_{d}\in L^{2}({\mathbb{R}},\tau^{1}_{d-1})$ , where $\tau^{1}_{d-1}$ is the one-dimensional projection of the uniform measure over $\mathbb{S}^{d-1}(\sqrt{d})$ . In particular:

$(i)$

$\tau^{1}_{d-1}$ is supported on $[-\sqrt{d},\sqrt{d}]$ . It is therefore sufficient that $\sup_{|u|\leq\sqrt{d}}|\sigma_{d}(u)|=C_{1}(d)<\infty$ . 2. $(ii)$

By an explicit calculation, the density of $\tau^{1}_{d-1}({\rm d}u)=C_{2}(d)(1-u^{2}/d)^{(d-3)/2}{\rm d}u$ . Since this density is bounded, it is sufficient that $\sigma_{d}$ is square integrable with respect to the Lebesgue measure on $[-\sqrt{d},\sqrt{d}]$ .

Remark 2.2.

If the activation $\sigma$ is independent of $d$ , Assumption 1. $(b)$ is satisfied as long as $\mu_{k}(\sigma)\neq 0$ for $k=0,\ldots,\ell$ , where $\mu_{k}(\sigma)$ is the $k$ -th Hermite coefficient of $\sigma$ (see Section 5 for definitions).

Remark 2.3.

The conclusion of Theorem 1. $(a)$ can be established222A first version of this manuscript, posted on arXiv, assumed such conditions. by a somewhat simpler proof if the activation function $\sigma$ is independent of $d$ and satisfies the following regularity conditions: $(i)$ $\sigma(u)^{2}\leq c_{0}\exp(c_{1}u^{2}/2)$ for some $c_{1}<1$ ; $(ii)$ $\sigma$ is not a polynomial of degree smaller than $2\ell+3$ . Under these conditions, the conclusion holds for $N=o_{d}(d^{\ell+1})$ .

Note that Assumption 1. $(b)$ requires in particular that $\sigma$ is not a polynomial of degree strictly smaller than $\ell$ . This is easily seen to be a necessary condition, since any linear combination of polynomials of degree $k<\ell$ is a polynomial of degree $k$ . For the same reason, this condition also arises in the approximation theory of neural networks [Pin99].

2.2 Approximation error of neural tangent models

For the NT model, the proof, while following the same scheme as for RF, is more challenging. We restrict our setting to a fixed activation function $\sigma$ (independent of dimensions) which is weakly differentiable, with weak derivative $\sigma^{\prime}$ that does not grow too fast (in particular, exponential growth is fine). We further require the Hermite decomposition of $\sigma^{\prime}$ to satisfy a mild ‘genericity’ condition. Recall that the $k$ -th Hermite coefficient of a function $h$ can be defined as $\mu_{k}(h)\equiv\mathbb{E}_{G\sim{\sf N}(0,1)}\{h(G){\rm He}_{k}(G)\}$ , where ${\rm He}_{k}(x)$ is the $k$ -th Hermite polynomial (see Section 5 for further background).

Assumption 2 (Assumptions for the NT model at level $\ell\in{\mathbb{N}}$ .).

Let $\sigma$ be an activation function $\sigma:{\mathbb{R}}\to{\mathbb{R}}$ .

(a)

The function $\sigma$ is weakly differentiable, with weak derivative $\sigma^{\prime}$ such that $\sigma^{\prime}(u)^{2}\leq c_{0}\exp(c_{1}u^{2}/2)$ for some constants $c_{0},c_{1}$ , with $c_{1}<1$ .

(b)

The Hermite coefficients $\{\mu_{k}(\sigma^{\prime})\}_{k\geq 0}$ are such that there exist $k_{1},k_{2}\geq 2\ell+7$ such that $\mu_{k_{1}}(\sigma^{\prime}),\mu_{k_{2}}(\sigma^{\prime})\neq 0$ and

[TABLE]

(c)

The Hermite coefficients of $\sigma$ satisfy $\mu_{k}(\sigma)\neq 0$ for any $k\leq\ell+1$ .

Theorem 2 (Risk of the NT model).

Let $\{f_{d}\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))\}_{d\geq 1}$ be a sequence of functions. Let ${\bm{W}}=({\bm{w}}_{i})_{i\in[N]}$ with $({\bm{w}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1})$ independently. We have the following results.

(a)

Assume $N=o_{d}(d^{\ell+1})$ for a fixed integer $\ell$ , and let $\sigma$ satisfy Assumptions 2.(a) and 2.(b) at level $\ell$ . Then, for any $\varepsilon>0$ , the following holds with high probability:

[TABLE]

(b)

Assume $N=\omega_{d}(d^{\ell})$ for some integer $\ell$ , and let $\sigma$ satisfy Assumptions 2.(a) and 2. $(c)$ * at level $\ell$ . Then for any $\varepsilon>0$ , the following holds with high probability:*

[TABLE]

See Section 8 for the proof of lower bound, and Section 9 for the proof of upper bound.

Remark 2.4.

It is easy to check that Assumptions 2. $(a)$ and 2. $(b)$ hold for all $\ell$ , for all commonly used activations.

For instance the ReLU activation $\sigma(u)=\max(u,0)$ and its weak derivative $\sigma^{\prime}(x)={\bm{1}}_{x\geq 0}$ have subexponential growth. Further its Hermite coefficients are $\mu_{0}(\sigma^{\prime})=1/2$ and

[TABLE]

which satisfy the required condition of Theorem 2. $(a)$ for each $\ell$ . (In checking the condition, it might be useful to notice the relation $\mu_{k}(x^{2}\sigma^{\prime})=\mu_{k+2}(\sigma^{\prime})+(2k+1)\mu_{k}(\sigma^{\prime})+k(k-1)\mu_{k-2}(\sigma^{\prime})$ .)

Assumption 2.(c) does not hold for ReLU activation $\sigma(u)=\max(u,0)$ , since $\mu_{k}(\sigma)=0$ for $k$ even. However it holds for shifted ReLU $\sigma(u)=\max(u-u_{0},0)$ , for a generic value of the shift $u_{0}$ .

Theorems 1 and 2 can be illustrated by a cartoon, which we show as Figure 5. In words, the approximation error plotted as a function of $\log(\#\text{parameters})/\log d$ follows a staircase: it drops close to integer values of this ratio, with each drop corresponding to the projection onto homogeneous polynomials of that degree. We can extract three useful statistical insights from these findings:

There is no difference between plain RF and the more recent NT approach in terms of approximation error, once we compare them at fixed number of parameters $p$ . All that changes is the relation between number of parameters and number of neurons: $p=N$ for RF, and $p=Nd$ for NT. The recent work [GMMM19] actually shows some advantage for the RF model, although in a special case. It is worth mentioning that the same equivalence holds when we consider the dependence on the sample size $n$ , at $N=\infty$ , see Section 3.

We notice however an important computational advantage for NT, at constant parameters number. Indeed, the complexity at prediction time is $O(Nd)=O(p)$ for NT, while it is $O(Nd)=O(pd)$ for RF. 2. 2.

RF or NT models behave similarly to expansions into orthogonal monomial basis. Also in that case, if only $o_{d}(d^{\ell+1})$ basis elements are included, for a ‘typical’ functions $f_{\star}$ , the approximation error333Here by ‘typical’ function we mean the following. Choose a function $f_{0,\star}\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d})$ , draw a Haar distributed orthogonal matrix ${\bm{S}}\in{\mathbb{R}}^{d\times d}$ , and set $f_{\star}({\bm{x}})=f_{0,\star}({\bm{S}}{\bm{x}})$ . is $\|{\mathsf{P}}_{>\ell}f_{\star}\|^{2}$ . 3. 3.

Our results also suggest interesting directions to improve random feature expansions. First, if $f_{\star}$ is known to primary depends on a small subset of $d_{1}\ll d$ directions in ${\mathbb{R}}^{d}$ , there will be a significant advantage in choosing the random features along that $d_{1}$ -dimensional subspace. Second, if the data points ${\bm{x}}_{i}$ lie close to to such a subspace $V\subseteq{\mathbb{R}}^{d}$ , $\dim(V)=d_{1}$ , one might hope that –even if the ${\bm{w}}_{i}$ are sampled isotropically in ${\mathbb{R}}^{d}$ – random feature methods will be sensitive to $d_{1}$ rather than $d$ . We plan to report on these topics in a future publication [GMMM20].

2.3 Separation between NN and RF, NT

Theorems 1 and 2 imply a separation of approximation power between two-layers neural networks and their linearization. As a simple example, consider the target function $f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle)$ , for $\|{\bm{w}}_{\star}\|_{2}=1$ . This can be represented exactly by a neural network with $N=1$ , i.e. by a single neuron. On the other hand, the above results imply that any RF or NT model is bound to have a non-vanishing population error, if $d^{\ell+\delta}\leq N\leq d^{\ell+1-\delta}$ . Provided $\sigma$ satisfies the Assumptions 1, 2, we get

[TABLE]

Here $\sigma_{>k}(x)$ is the projection of $\sigma$ orthogonal to the subspace of polynomials of maximum degree $k$ , in $L^{2}({\mathbb{R}},\gamma)$ , where $\gamma({\rm d}x)=e^{-x^{2}/2}{\rm d}x/\sqrt{2\pi}$ is the standard Gaussian measure.

Crucially, as proven in [MBM16], running gradient descent over the space of neural networks consisting of a single neuron allows to learn the target function $f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle)$ efficiently. In other words, we do not have simply a separation between the function classes ${\mathcal{F}}_{{\sf NN}}$ and ${\mathcal{F}}_{{\sf RF}}$ or ${\mathcal{F}}_{{\sf NT}}$ , but a separation between linearized neural networks, and neural networks trained by gradient descent.

Essentially the same example was independently considered by Yehudai and Shamir in concurrent work [YS19]. These authors prove that there exist finite constants $c_{0},c_{1}>0$ such that, if $N\leq\exp\{c_{1}d\}$ and the coefficients $a_{i},{\bm{a}}_{i}$ have magnitude at most $\exp\{c_{1}d\}$ , then there exists a vector ${\bm{w}}_{\star}$ such that, setting $f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle)$ , then $R_{{\sf RF}}(f_{*};{\bm{W}}),R_{{\sf NT}}(f_{*};{\bm{W}})\geq c_{0}$ . An important difference with respect to our separation result is in the fact that Eq. 10 holds –once again– pointwise, i.e. for any fixed ${\bm{w}}_{\star}$ , while in [YS19] ${\bm{w}}_{\star}$ is chosen by an adversary who has knowledge of the vectors $({\bm{w}}_{i})_{i\leq N}$ . Let us emphasize there are other important differences between our setting and the one of [YS19], and neither of the two analysis implies the other.

The same blueprint can be followed to prove further separation results. For instance, consider $f_{\star}({\bm{x}})=\varphi({\bm{Q}}^{{\mathsf{T}}}{\bm{x}})$ , for ${\bm{Q}}\in{\mathbb{R}}^{d\times r}$ an orthogonal matrix and $\varphi:{\mathbb{R}}^{r}\to{\mathbb{R}}$ a bounded smooth function, which is not a polynomial. If $r$ is kept constant as $d^{\ell+\delta}\leq N\leq d^{\ell+1-\delta}$ , Theorems 1 and 2 can be used to show that $R_{{\sf RF}}(f_{*};{\bm{W}}),R_{{\sf NT}}(f_{*};{\bm{W}})$ are bounded away from zero and to compute their limits. On the other hand, by classical results [Mai99] can be used to show that such $f_{*}({\bm{x}})$ can be approximated arbitrarily well by neural networks with $O_{d}(1)$ neurons (with first layer weights ${\bm{w}}_{i}$ in the span of columns of ${\bm{Q}}$ ). Unfortunately, we are not aware of general results implying that such neural networks can be learnt by gradient descent, although we expect this to be the case for certain choices of $\varphi$ . Whenever such a result is available, it implies a separation between RF, NT, and practical neural networks.

3 Generalization error of kernel methods

We consider next the limit of very wide networks. Namely, we let $N\to\infty$ before $n,d\to\infty$ . It is known since the work of Rahimi and Recht [RR08] that ridge regression over the function class ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ converges in this limit to kernel ridge regression (KRR) with respect to the kernel (here expectation is with respect to ${\bm{w}}\sim{\sf Unif}(\mathbb{S}^{d-1}(1))$ )

[TABLE]

Analogously, ridge regression in ${\mathcal{F}}_{{\sf NT}}({\bm{W}})$ can be shown to converge to KRR with respect to the kernel

[TABLE]

We will denote the corresponding RKHS by ${\mathcal{H}}_{{\sf RF}}$ and ${\mathcal{H}}_{{\sf NT}}$ . Quantitative estimates on the relation between ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ and ${\mathcal{H}}_{{\sf RF}}$ are obtained in [Bac17b], which shows that the unit ball of ${\mathcal{H}}_{{\sf RF}}$ is well approximated by the unit ball of ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ (endowed with the $\ell_{2}$ norm of the coefficients $(a_{i})_{i\leq N}$ ), for $N$ large enough.

Notice that both kernels $H^{{\sf RF}}_{d}$ , $H^{{\sf NT}}_{d}$ are rotationally invariant, namely $H_{d}({\bm{S}}{\bm{x}}_{1},{\bm{S}}{\bm{x}}_{2})=H_{d}({\bm{x}}_{1},{\bm{x}}_{2})$ for $H_{d}\in\{H^{{\sf RF}}_{d},H^{{\sf NT}}_{d}\}$ and any $d\times d$ orthogonal matrix ${\bm{S}}$ . Any rotationally invariant kernel on the sphere $\mathbb{S}^{d-1}(\sqrt{d})$ takes the form

[TABLE]

for some function $h_{d}:[-1,1]\to{\mathbb{R}}$ . (The scaling factor $d$ is introduced here to make contact with the normalization used in previous sections, and is not necessary: indeed, $h_{d}$ can depend itself on $d$ .)

Our results apply to general rotational invariant kernels under very weak conditions on the function $h_{d}$ . In particular, they apply to multilayer neural networks in the neural tangent regime. Namely consider a $L$ -layers network with matrix weights ${\bm{W}}_{1}\in{\mathbb{R}}^{N_{1}\times d}$ , ${\bm{W}}_{2}\in{\mathbb{R}}^{N_{2}\times N_{1}}$ , … ${\bm{W}}_{L-1}\in{\mathbb{R}}^{N_{L-1}\times N_{L-2}}$ , ${\bm{a}}\in{\mathbb{R}}^{N_{L-1}}$ . As long as all the weights are initialized as independent centered Gaussians, with variance dependent only on the layer, the resulting NT kernel is rotationally invariant. The recent papers [DZPS18, DLL*+*18, AZLS18, ZCZG18, ADH*+*19] provide conditions under which the NT approximation is accurate for SGD-trained multilayer neural networks.

Section 3.1 presents a lower bound on the prediction error of general kernel methods, and Section 3.2 derives an upper bound for kernel ridge regression.

Throughout this section, we consider the same data model as in the previous sections: we observe pairs $(y_{i},{\bm{x}}_{i})_{i\in[n]}$ , with $({\bm{x}}_{i})_{i\in[n]}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , and $y_{i}=f_{\star}({\bm{x}}_{i})+\varepsilon_{i}$ , $f_{\star}\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ and $\varepsilon_{i}\sim{\sf N}(0,\tau^{2})$ independently.

3.1 Lower bound for general kernel methods

Consider any regression method of the form

[TABLE]

where $\|f\|_{H}$ is the reproducing kernel Hilbert space (RKHS) norm with respect to the kernel $H$ of the form (13). By the representer theorem [BTA11] there exist coefficients $\hat{a}_{1},\dots,\hat{a}_{n}$ such that

[TABLE]

We are therefore led to define the following data-dependent prediction risk function for kernel methods

[TABLE]

The next theorem provides a decomposition of this generalization error that is analogous to the one given in Theorem 1. $(a)$ . Notice however that the controlling factor is not the number of neurons $N$ , but instead the sample size $n$ .

Theorem 3.

Assume $n\leq d^{\ell+1-\delta_{d}}$ for a fixed integer $\ell$ and any sequence $\delta_{d}$ such that $\delta_{d}^{2}\log d\to\infty$ (in particular, $n\leq d^{\ell+1-\delta}$ is sufficient for any fixed $\delta>0$ ). Let $\{f_{d}\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))\}_{d\geq 1}$ be a sequence of functions, $\{{\bm{x}}_{i}\}_{i\in[n]}\sim_{iid}{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ with $y_{i}=f_{d}({\bm{x}}_{i})$ . Assume $h_{n}(\langle{\bm{e}}_{1},\,\cdot\,\rangle/\sqrt{d})\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Then for any $\varepsilon>0$ , with high probability as $d\to\infty$ , we have

[TABLE]

Proof.

This follows immediately from Theorem 1. $(a)$ . Indeed, setting $\sigma_{d}(u)=h_{d}(u/\sqrt{d})$ and ${\bm{w}}_{i}={\bm{x}}_{i}/\sqrt{d}$ , we obtain $R_{H}(f_{d},{\bm{X}})=R_{{\sf RF}}(f_{d},{\bm{W}})$ , whence the claim follows by applying Eq. (3). ∎

3.2 Upper bound for kernel ridge regression

Kernel ridge regression is one specific way of selecting the coefficients $\hat{\bm{a}}$ in Eq. (15), namely by using $\ell(\hat{y},y)=(\hat{y}-y)^{2}$ in Eq. (14). Solving for the coefficients yields

[TABLE]

where the kernel matrix ${\bm{H}}=(H_{ij})_{ij\in[n]}$ is given by

[TABLE]

and ${\bm{y}}=(y_{1},\ldots,y_{n})^{\mathsf{T}}$ . The prediction function at location ${\bm{x}}$ is given by

[TABLE]

where

[TABLE]

The test error of empirical kernel ridge regression is defined as

[TABLE]

We assume that $\{h_{d}\}_{d\geq 1}$ are positive-definite kernels, and we consider the associated eigenvalues:

[TABLE]

where we recall that $Q_{k}^{(d)}$ is the $k$ -th Gegenbauer polynomial.

Assumption 3 (Assumption for KRR at level $\ell\in\mathbb{N}$ ).

Let $\{h_{d}\}_{d\geq 1}$ be a sequence of functions $h_{d}:{\mathbb{R}}\to{\mathbb{R}}$ , such that $H_{d}({\bm{x}}_{1},{\bm{x}}_{2})=h_{d}(\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/d)$ is a positive semidefinite kernel.

(a)

$h_{d}(\cdot/\sqrt{d})\in L^{2}([-\sqrt{d},\sqrt{d}],\tau^{1}_{d-1})$ , where $\tau^{1}_{d-1}$ is the distribution of $\langle{\bm{x}},{\bm{e}}\rangle$ for ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , where ${\bm{e}}=(1,0,\ldots,0)^{\mathsf{T}}\in\mathbb{R}^{d}$ .

(b)

There exists a constant $c_{\ell}>0$ such that

[TABLE]

Theorem 4.

Assume $\omega_{d}(d^{\ell}\log d)\leq n\leq O_{d}(d^{\ell+1-\delta})$ for some integer $\ell$ and $\delta>0$ . Let $\{f_{d}\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))\}_{d\geq 1}$ be a sequence of functions. Let $\{h_{d}\}_{d\geq 1}$ be a sequence of kernels satisfying Assumption 3 at level $\ell$ . Further define

[TABLE]

If $h_{d}$ has zero mean (i.e. $\int h_{d}(\sqrt{d}\langle{\bm{e}}_{1},{\bm{x}}\rangle)\tau_{d}({\rm d}{\bm{x}})=0$ ) further assume that $f_{d}$ is centered (i.e. $\int f_{d}({\bm{x}})\tau_{d}({\rm d}{\bm{x}})=0$ ).

Let ${\bm{X}}=({\bm{x}}_{i})_{i\in[n]}$ with $({\bm{x}}_{i})_{i\in[n]}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ independently, and $y_{i}=f_{d}({\bm{x}}_{i})+\varepsilon_{i}$ and $\varepsilon_{i}\sim_{iid}{\sf N}(0,\tau^{2})$ . Then for any $\varepsilon>0$ , and any regularization parameter $\lambda\in(0,\lambda_{*})$ with high probability we have

[TABLE]

See Section 10 for the proof of this theorem.

Remark 3.1.

Assume $h_{d}\to h$ as $d\to\infty$ , uniformly over $[-\delta,\delta]$ , together with its derivatives, and further assume $|h_{d}(x)|\leq c_{0}\exp(c_{1}x^{2}/2)$ for some $c_{0}>0$ , $c_{1}<1$ . We expect this to be the case for many kernels of interest, and in particular it can be shown to be the case for $h_{d}^{{\sf RF}}$ and $h_{d}^{{\sf NT}}$ under mild conditions on the activation $\sigma$ . Using Rodrigues’ formula described in Section 5.2, by an application of integration by part followed by dominated convergence, we get

[TABLE]

where $h^{(k)}$ is the $k$ -th derivative of $h$ . Notice further that $\xi_{d,k}(h_{d})\geq 0$ for all $k$ since $h_{d}$ is positive semidefinite by definition. Therefore, as long as $h^{(k)}(0)>0$ for all $k\leq\ell$ , Assumption 3 is satisfied, and $\lambda_{*}(d,\ell)$ is bounded away from [math].

Remark 3.2.

For $h_{d}=h_{d}^{{\sf RF}}$ and if the activation $\sigma\in L^{2}({\mathbb{R}},\gamma)$ is independent of $d$ , we have $\xi_{d,k}(h_{d})=\mu_{k}(\sigma)^{2}d^{-k}+o_{d}(d^{-k-1})$ , and therefore Assumption 3 is satisfied as soon as $\mu_{k}(\sigma)\neq 0$ for all $k\leq\ell$ .

Notice that the setting of Theorem 4 is the same as in classical nonparametric regression. However, classical theory typically establishes minimax consistency rates of the form $\mathbb{E}\{[\hat{f}({\bm{x}})-f_{\star}({\bm{x}})]^{2}\}\leq C(d)\,n^{-2\beta/(2\beta+d)}$ [Tsy08, GKKW06]. In order to guarantee a fixed (small) error, these bounds require $n\geq\exp\{c\,d\}$ . Modern machine learning typically have $d\geq 100$ and $n$ between $10^{4}$ and $10^{8}$ , and it is therefore unrealistic to consider $n$ exponential in $d$ . This regime motivates a new type of question: assuming $n\asymp d^{\alpha}$ , what is the minimum prediction error that can be achieved? This question is addressed by Theorem 4.

3.3 Separation between kernel methods and neural networks

Repeating the same argument of Section 2.3, we see that Theorems 3 and 4 imply a separation between kernel methods, with rotationally invariant kernels, and gradient-descent trained neural networks.

Namely, consider again the target function $f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle)$ , for $\|{\bm{w}}_{\star}\|_{2}=1$ . As proven in [MBM16], $f_{\star}$ can be learnt efficiently by minimizing the following empirical risk via gradient descent:

[TABLE]

Namely, if $n\geq C\,d\log d$ samples are used (and under some technical conditions on $\sigma$ ), gradient descent reaches prediction error of order $(d\log d)/n$

In contrast, Theorems 3 and 4 imply that, for any integer $\ell$ , and any $d^{\ell+\delta}\leq n\leq d^{\ell+1-\delta}$ , any kernel method has test error bounded away from zero. Namely

[TABLE]

This test error is achieved by kernel ridge regression.

3.4 Near-optimality of interpolators

Let us emphasize some important statistical aspects of Theorem 4. KRR is proved to achieve near optimal prediction error (matching the lower bound of Theorem 3) pointwise, i.e. per given function $f_{d}$ . What is the nature of the predictor $\hat{f}_{\lambda}$ ? Theorems 3 and 4 imply that, in $\ell_{2}$ sense, $\hat{f}_{\lambda}$ must be close to a low-degree approximation of $f_{d}$ , namely ${\mathsf{P}}_{\leq\ell}f_{d}$ .

Optimal test error is achieved for any $\lambda<\lambda_{*}$ . In particular, by taking $\lambda\to 0$ , we obtain an interpolator, i.e. a predictor that interpolates the data $(y_{i},{\bm{x}}_{i})$ . This remark is made quantitative in the following bound on the empirical risk

[TABLE]

Theorem 5.

Assume $\omega_{d}(d^{\ell}\log d)\leq n\leq O_{d}(d^{\ell+1-\delta})$ for some integer $\ell$ and $\delta>0$ . Under the same assumptions of Theorem 4, if $\lambda<\lambda_{*}$ , then

[TABLE]

where $\kappa_{h}=\sum_{k\geq\ell+1}\xi_{d,k}(h_{d})B(d,k)$ .

Proof of Theorem 5.

Recall that the empirical risk of KRR is given by Eq. (24), where $\hat{\bm{f}}_{\lambda}=(\hat{f}_{\lambda}({\bm{x}}_{1}),\ldots,\hat{f}_{\lambda}({\bm{x}}_{n}))$ can be rewritten as

[TABLE]

Therefore,

[TABLE]

From the proof of Theorem 4, we have the following lower bound on the eigenvalues ${\bm{H}}+\lambda{\mathbf{I}}_{n}\succeq(\kappa_{h}+\lambda+o_{d,\mathbb{P}}(1)){\mathbf{I}}_{n}$ . We deduce that with high probability

[TABLE]

where we simply used the law of large numbers $\|{\bm{y}}\|_{2}^{2}/n\to\|f_{d}\|_{L^{2}}^{2}+\tau^{2}$ . ∎

3.5 A conjecture for generalization error of random features model

Consider random features regression with finite sample size and a finite number of neurons. We fit data $\{(y_{i},{\bm{x}}_{i})\}_{i\leq n}$ using ridge regression in the random features ( ${\sf RF}$ ) model, with (where ${\bm{w}}_{i}\sim_{iid}{\sf Unif}(\mathbb{S}^{d-1}(1))$ )

[TABLE]

Under the same data model of the previous sections, we are interested in the test prediction error

[TABLE]

Theorem 1 characterized the test error $R_{\sf RF}(f_{d},{\bm{X}},{\bm{W}},\lambda)$ in the population limit $n=\infty$ , whereas Theorems 3 and 4 characterize the same quantity in the case when $N=\infty$ .

What happens when both $n$ and $N$ are finite? In the proportional regime $N\propto d$ and $n\propto d$ , the precise asymptotics of $R_{\sf RF}(f_{d},{\bm{X}},{\bm{W}},\lambda)$ was calculated in [MM19].

What happens beyond the proportional asymptotics? We conjecture that the limiting factor is given by the smallest of $n$ and $N$ . Namely, if $d^{\ell+\delta}\leq\min(n,N)\leq d^{\ell+1-\delta}$ for some positive $\delta$ , then the prediction error is the same as the one of fitting a degree- $\ell$ polynomial, i.e. $R_{\sf RF}(f_{d},{\bm{X}},{\bm{W}},\lambda)=\|{\mathsf{P}}_{>\ell}f_{d}\|_{L^{2}}^{2}+\|f_{d}\|_{L^{2}}^{2}\cdot o_{d,\mathbb{P}}(1)$ . We leave this conjecture to future work.

4 Further related work

Donoho and Johnstone [DJ89] study an approximation problem analogous to the one we considered in Section 2, although in $d=2$ dimensions. Their problem essentially reduces to determining rates of approximation on the unit circle, with the technical difference that the ${\bm{w}}_{i}$ ’s are equi-spaced along the circle instead of being random. As for other references mentioned in Section 1.2, the lower bounds of [DJ89] are worst case over differentiable functions.

The limitations of kernel methods in high-dimension are studied by El Karoui in [EK10b] (see also [EK10a]), which analyzes kernel random matrices of the form ${\bm{H}}=(h(\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle/d))_{i,j\leq n}$ . The analysis of [EK10b] is limited to the proportional asymptotics $n\propto d$ . and establishes that in this regime ${\bm{H}}$ is well approximated by the Gram matrix of raw feature vectors plus a diagonal term: ${\bm{H}}\approx(h(1)-h^{\prime}(0)){\mathbf{I}}_{n}+h^{\prime}(0){\bm{G}}$ , where ${\bm{G}}=(\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle/d)_{i,j\leq n}$ . This result is related to our Theorems 3 and 4, which deal with kernel methods. However our results analyze general polynomial scalings $n=O_{d}(d^{\ell+1-\delta})$ , while [EK10b] assumes $n=\Theta_{d}(d)$ . Also [EK10b] analyzes the spectrum of ${\bm{H}}$ but not the prediction error of kernel methods. Finally, a large part of our technical work is devoted to RF and NT models, cf. Theorems 1 and 2, which are not touched upon by [EK10b].

Recent work of Vempala and Wilmes [VW18] analyzes what amounts to an RF model. These authors prove that RF can learn a degree- $\ell$ polynomial from $n=d^{O(\ell)}$ samples using $N=d^{O(\ell)}$ neurons, and that at least $d^{\Omega(\ell)}$ queries are needed within the statistical query model. While related, our setting is not directly comparable to theirs. Notice further that we obtain a sharper tradeoff, since we obtain the precise exponents of $d$ .

After the present paper appeared as a preprint, several authors presented important contributions to the same line of work. In particular, Liang, Rakhlin, and Zhai [LRZ19] studies kernel ridge regression in $d$ dimension using $n=O_{d}(d^{\gamma})$ samples. Assuming the target function has bounded RKHS norm, they derive upper and lower bounds on the rate of convergence of the generalization error. This result is related to our Theorem 3. The most important difference is that we do not assume that the target function has bounded RKHS norm. Instead we obtain the precise asymptotics of the generalization error in a regime in which it is non-vanishing. As illustrated in Section 1.3, this asymptotic analysis captures indeed the actual behavior in practically reasonable settings.

From a technical viewpoint, several of our calculations make use of harmonic analysis over the $d$ -dimensional sphere, as it is natural given that ${\bm{x}}_{i}$ ’s are uniform over the sphere. Spherical harmonics expansion appear in related contexts, e.g. in [DJ89, Bac17a, VW18].

Let us finally mention that an alternative approach to the analysis of two-layers neural networks in the wide limit, was developed in [MMN18, RVE18, SS18, CB18, MMM19] using mean field theory. Unlike in the neural tangent approach, the evolution of network weights is described beyond the linear regime in this theory.

5 Technical background

In this section we introduce some notation and technical background which will be useful for the proofs in the next sections. In particular, we will use decompositions in (hyper-)spherical harmonics on the $\mathbb{S}^{d-1}(\sqrt{d})$ and in orthogonal polynomials on the real line. All of the properties listed below are classical: we will however prove a few facts that are slightly less standard. We refer the reader to [EF14, Sze39, Chi11] for further information on these topics. As mentioned above, expansions in spherical harmonics were used in the past in the statistics literature, for instance in [DJ89, Bac17a].

5.1 Functional spaces over the sphere

For $d\geq 1$ , we let $\mathbb{S}^{d-1}(r)=\{{\bm{x}}\in\mathbb{R}^{d}:\|{\bm{x}}\|_{2}=r\}$ denote the sphere with radius $r$ in ${\mathbb{R}}^{d}$ . We will mostly work with the sphere of radius $\sqrt{d}$ , $\mathbb{S}^{d-1}(\sqrt{d})$ and will denote by $\tau_{d-1}$ the uniform probability measure on $\mathbb{S}^{d-1}(\sqrt{d})$ . All functions in the following are assumed to be elements of $L^{2}(\mathbb{S}^{d-1}(\sqrt{d}),\tau_{d-1})$ , with scalar product and norm denoted as $\langle\,\cdot\,,\,\cdot\,\rangle_{L^{2}}$ and $\|\,\cdot\,\|_{L^{2}}$ :

[TABLE]

For $\ell\in{\mathbb{Z}}_{\geq 0}$ , let $\tilde{V}_{d,\ell}$ be the space of homogeneous harmonic polynomials of degree $\ell$ on ${\mathbb{R}}^{d}$ (i.e. homogeneous polynomials $q({\bm{x}})$ satisfying $\Delta q({\bm{x}})=0$ ), and denote by $V_{d,\ell}$ the linear space of functions obtained by restricting the polynomials in $\tilde{V}_{d,\ell}$ to $\mathbb{S}^{d-1}(\sqrt{d})$ . With these definitions, we have the following orthogonal decomposition

[TABLE]

The dimension of each subspace is given by

[TABLE]

For each $\ell\in{\mathbb{Z}}_{\geq 0}$ , the spherical harmonics $\{Y_{\ell,j}^{(d)}\}_{1\leq j\in\leq B(d,\ell)}$ form an orthonormal basis of $V_{d,\ell}$ :

[TABLE]

Note that our convention is different from the more standard one, that defines the spherical harmonics as functions on $\mathbb{S}^{d-1}(1)$ . It is immediate to pass from one convention to the other by a simple scaling. We will drop the superscript $d$ and write $Y_{\ell,j}=Y_{\ell,j}^{(d)}$ whenever clear from the context.

We denote by ${\mathsf{P}}_{k}$ the orthogonal projections to $V_{d,k}$ in $L^{2}(\mathbb{S}^{d-1}(\sqrt{d}),\tau_{d-1})$ . This can be written in terms of spherical harmonics as

[TABLE]

We also define ${\mathsf{P}}_{\leq\ell}\equiv\sum_{k=0}^{\ell}{\mathsf{P}}_{k}$ , ${\mathsf{P}}_{>\ell}\equiv{\mathbf{I}}-{\mathsf{P}}_{\leq\ell}=\sum_{k=\ell+1}^{\infty}{\mathsf{P}}_{k}$ , and ${\mathsf{P}}_{<\ell}\equiv{\mathsf{P}}_{\leq\ell-1}$ , ${\mathsf{P}}_{\geq\ell}\equiv{\mathsf{P}}_{>\ell-1}$ .

5.2 Gegenbauer polynomials

The $\ell$ -th Gegenbauer polynomial $Q_{\ell}^{(d)}$ is a polynomial of degree $\ell$ . Consistently with our convention for spherical harmonics, we view $Q_{\ell}^{(d)}$ as a function $Q_{\ell}^{(d)}:[-d,d]\to{\mathbb{R}}$ . The set $\{Q_{\ell}^{(d)}\}_{\ell\geq 0}$ forms an orthogonal basis on $L^{2}([-d,d],\tilde{\tau}^{1}_{d-1})$ , where $\tilde{\tau}^{1}_{d-1}$ is the distribution of $\sqrt{d}\langle{\bm{x}},{\bm{e}}_{1}\rangle$ when ${\bm{x}}\sim\tau_{d-1}$ , satisfying the normalization condition:

[TABLE]

In particular, these polynomials are normalized so that $Q_{\ell}^{(d)}(d)=1$ . As above, we will omit the superscript $d$ when clear from the context.

Gegenbauer polynomials are directly related to spherical harmonics as follows. Fix ${\bm{v}}\in\mathbb{S}^{d-1}(\sqrt{d})$ and consider the subspace of $V_{\ell}$ formed by all functions that are invariant under rotations in ${\mathbb{R}}^{d}$ that keep ${\bm{v}}$ unchanged. It is not hard to see that this subspace has dimension one, and coincides with the span of the function $Q_{\ell}^{(d)}(\langle{\bm{v}},\,\cdot\,\rangle)$ .

We will use the following properties of Gegenbauer polynomials

For ${\bm{x}},{\bm{y}}\in\mathbb{S}^{d-1}(\sqrt{d})$

[TABLE] 2. 2.

For ${\bm{x}},{\bm{y}}\in\mathbb{S}^{d-1}(\sqrt{d})$

[TABLE] 3. 3.

Recurrence formula

[TABLE] 4. 4.

Rodrigues’ formula

[TABLE]

Note in particular that property 2 implies that –up to a constant– $Q_{k}^{(d)}(\langle{\bm{x}},{\bm{y}}\rangle)$ is a representation of the projector onto the subspace of degree - $k$ spherical harmonics

[TABLE]

For a function $\sigma\in L^{2}([-\sqrt{d},\sqrt{d}],\tau^{1}_{d-1})$ (where $\tau^{1}_{d-1}$ is the distribution of $\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/\sqrt{d}$ when ${\bm{x}}_{1},{\bm{x}}_{2}\sim_{iid}{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ ), denoting its spherical harmonics coefficients $\lambda_{d,k}(\sigma)$ to be

[TABLE]

then we have the following equation holds in $L^{2}([-\sqrt{d},\sqrt{d}],\tau^{1}_{d-1})$ sense

[TABLE]

To any rotationally invariant kernel $H_{d}({\bm{x}}_{1},{\bm{x}}_{2})=h_{d}(\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/d)$ , with $h_{d}(\sqrt{d}\,\cdot\,)\in L^{2}([-\sqrt{d},\sqrt{d}],\tau^{1}_{d-1})$ , we can associate a self adjoint operator $\mathscrsfs{H}_{d}:L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))\to L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ via

[TABLE]

By rotational invariance, the space $V_{k}$ of homogeneous polynomials of degree $k$ is an eigenspace of $\mathscrsfs{H}_{d}$ , and we will denote the corresponding eigenvalue by $\xi_{d,k}(h_{d})$ . In other words $\mathscrsfs{H}_{d}f({\bm{x}}):=\sum_{k=0}^{\infty}\lambda_{d,k}(h_{d}){\mathsf{P}}_{k}f$ . The eigenvalues can be computed via

[TABLE]

5.3 Hermite polynomials

The Hermite polynomials $\{{\rm He}_{k}\}_{k\geq 0}$ form an orthogonal basis of $L^{2}({\mathbb{R}},\gamma)$ , where $\gamma({\rm d}x)=e^{-x^{2}/2}{\rm d}x/\sqrt{2\pi}$ is the standard Gaussian measure, and ${\rm He}_{k}$ has degree $k$ . We will follow the classical normalization (here and below, expectation is with respect to $G\sim{\sf N}(0,1)$ ):

[TABLE]

As a consequence, for any function $g\in L^{2}({\mathbb{R}},\gamma)$ , we have the decomposition

[TABLE]

The Hermite polynomials can be obtained as high-dimensional limits of the Gegenbauer polynomials introduced in the previous section. Indeed, the Gegenbauer polynomials (up to a $\sqrt{d}$ scaling in domain) are constructed by Gram-Schmidt orthogonalization of the monomials $\{x^{k}\}_{k\geq 0}$ with respect to the measure $\tilde{\tau}^{1}_{d-1}$ , while Hermite polynomial are obtained by Gram-Schmidt orthogonalization with respect to $\gamma$ . Since $\tilde{\tau}^{1}_{d-1}\Rightarrow\gamma$ (here $\Rightarrow$ denotes weak convergence), it is immediate to show that, for any fixed integer $k$ ,

[TABLE]

Here and below, for $P$ a polynomial, ${\rm Coeff}\{P(x)\}$ is the vector of the coefficients of $P$ . As a consequence, for any fixed integer $k$ , we have

[TABLE]

where $\mu_{k}(\sigma)$ and $\lambda_{d,k}(\sigma)$ are given in Eq. (42) and (38).

5.4 Notations

Throughout the proofs, $O_{d}(\,\cdot\,)$ (resp. $o_{d}(\,\cdot\,)$ ) denotes the standard big-O (resp. little-o) notation, where the subscript $d$ emphasizes the asymptotic variable. We denote $O_{d,\mathbb{P}}(\,\cdot\,)$ (resp. $o_{d,\mathbb{P}}(\,\cdot\,)$ ) the big-O (resp. little-o) in probability notation: $h_{1}(d)=O_{d,\mathbb{P}}(h_{2}(d))$ if for any $\varepsilon>0$ , there exists $C_{\varepsilon}>0$ and $d_{\varepsilon}\in\mathbb{Z}_{>0}$ , such that

[TABLE]

and respectively: $h_{1}(d)=o_{d,\mathbb{P}}(h_{2}(d))$ , if $h_{1}(d)/h_{2}(d)$ converges to [math] in probability.

We will occasionally hide logarithmic factors using the $\tilde{O}_{d}(\,\cdot\,)$ notation (resp. $\tilde{o}_{d}(\,\cdot\,)$ ): $h_{1}(d)=\tilde{O}_{d}(h_{2}(d))$ if there exists a constant $C$ such that $h_{1}(d)\leq C(\log d)^{C}h_{2}(d)$ . Similarly, we will denote $\tilde{O}_{d,\mathbb{P}}(\,\cdot\,)$ (resp. $\tilde{o}_{d,\mathbb{P}}(\,\cdot\,)$ ) when considering the big-O in probability notation up to a logarithmic factor.

6 Proof of Theorem 1.(a): RF model lower bound

6.1 Proof of Theorem 1.(a): Outline

Recall that $({\bm{w}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1})$ independently. We define ${\bm{\theta}}_{i}=\sqrt{d}\cdot{\bm{w}}_{i}$ for $i\in[N]$ , so that $({\bm{\theta}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ independently. Let ${\bm{W}}=({\bm{w}}_{1},\ldots,{\bm{w}}_{N})$ , and ${\bm{\Theta}}=({\bm{\theta}}_{1},\ldots,{\bm{\theta}}_{N})$ . We denote $\mathbb{E}_{\bm{\theta}}$ to be the expectation operator with respect to ${\bm{\theta}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , $\mathbb{E}_{\bm{x}}$ to be the expectation operator with respect to ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , and $\mathbb{E}_{\bm{w}}$ to be the expectation operator with respect to ${\bm{w}}\sim{\sf Unif}(\mathbb{S}^{d-1}(1))$ .

Define the random vectors ${\bm{V}}=(V_{1},\ldots,V_{N})^{\mathsf{T}}$ , ${\bm{V}}_{\leq\ell}=(V_{1,\leq\ell},\ldots,V_{N,\leq\ell})^{\mathsf{T}}$ , ${\bm{V}}_{>\ell}=(V_{1,>\ell},\ldots,V_{N,>\ell})^{\mathsf{T}}$ , with

[TABLE]

Define the random matrix ${\bm{U}}=(U_{ij})_{i,j\in[N]}$ , with

[TABLE]

In what follows, we write $R_{{\sf RF}}(f_{d})=R_{{\sf RF}}(f_{d},{\bm{W}})=R_{{\sf RF}}(f_{d},{\bm{\Theta}}/\sqrt{d})$ for the random features risk, omitting the dependence on the weights ${\bm{W}}={\bm{\Theta}}/\sqrt{d}$ . By the definition and a simple calculation, we have

[TABLE]

By orthogonality, we have

[TABLE]

which gives

[TABLE]

where the last inequality used the fact that

[TABLE]

so that

[TABLE]

We claim that we have

[TABLE]

This is achieved by the Proposition 1 and 2 stated below.

We will denote below by $\lambda_{k}(\sigma_{d})$ , $k\geq 0$ , the coefficients of $\sigma_{d}$ in the basis of Gegenbauer polynomials. Explicitly, since $\sigma_{d}(\langle{\bm{e}},\cdot\rangle)\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ , we can expand $\sigma_{d}$ as

[TABLE]

where

[TABLE]

Proposition 1 (Expected norm of ${\bm{V}}$ ).

Let $\{\sigma_{d}\}_{d\geq 1}$ be a sequence of activation functions with $\sigma_{d}(\langle{\bm{e}},\cdot\rangle)\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Define ${\mathcal{E}}_{\geq\ell}$ by

[TABLE]

Then

[TABLE]

Proposition 2 (Lower bound on the kernel matrix).

Assume $N\leq d^{\ell+1}/e^{A_{d}\sqrt{\log d}}$ for a fixed integer $\ell$ and any $A_{d}\to\infty$ (in particular, $N\leq d^{\ell+1-\delta}$ is sufficient for any fixed $\delta>0$ ). Let $({\bm{\theta}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ independently, and $\{\sigma_{d}\}_{d\geq 1}$ be a sequence of activation functions with $\sigma_{d}(\langle{\bm{e}},\cdot\rangle)\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Let ${\bm{U}}\in\mathbb{R}^{N\times N}$ be the kernel matrix defined by Eq. (48). Then for any $\varepsilon\in(0,1)$ ,

[TABLE]

with high probability as $d\to\infty$ .

The proof of Proposition 2 relies on the following tight bound on the operator norm of the Gegenbauer polynomials of the Gram matrix:

Proposition 3 (Bound on the Gram matrix).

Let $N\leq d^{k}/e^{A_{d}\sqrt{\log d}}$ for a fixed integer $k$ and any $A_{d}\to\infty$ . Let $({\bm{\theta}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ independently, and $Q_{k}^{(d)}$ be the $k$ ’th Gegenbauer polynomial with domain $[-d,d]$ . Consider the random matrix ${\bm{W}}=({\bm{W}}_{ij})_{i,j\in[N]}\in\mathbb{R}^{N\times N}$ , with ${\bm{W}}_{ij}=Q_{k}^{(d)}(\langle{\bm{\theta}}_{i},{\bm{\theta}}_{j}\rangle)$ . Then we have

[TABLE]

The proofs of these three propositions are provided in the next sections. Proposition 1 implies

[TABLE]

From Proposition 2, we have with high probability

[TABLE]

Then by Markov inequality, we have with high probability

[TABLE]

Equation (50) follows by noting that $B(d,k)$ is non-decreasing in $k$ (see Lemma 1 below) and $B(d,\ell+1)=\Theta_{d}(d^{\ell+1})$ , and recalling $N=o_{d}(d^{\ell+1})$ . Combining with Eq. (49), the theorem holds.

Lemma 1.

The number $B(d,k)$ of independent degree- $k$ spherical harmonics on $\mathbb{S}^{d-1}$ is non-decreasing in $k$ for any fixed $d\geq 2$ .

Proof of Lemma 1.

By [EF14, Section 4.1], we have

[TABLE]

and

[TABLE]

where $K(d-1,j)={d-2+j\choose j}$ is non-negative for $d\geq 2$ . This immediately shows that $B(d,k)$ is non-decreasing in $k$ . ∎

6.2 Proof of Proposition 1

The quantity ${\mathcal{E}}_{\geq\ell}$ can be rewritten as

[TABLE]

First we calculate $\mathbb{E}_{\bm{x}}[{\mathsf{P}}_{k}f_{\star}({\bm{x}})\sigma_{d}(\langle{\bm{\theta}},{\bm{x}}\rangle/\sqrt{d})\rangle]$ . Note the spherical harmonics expansion of ${\mathsf{P}}_{k}f_{\star}$ gives

[TABLE]

and the Gegenbauer expansion of $\sigma_{d}$ gives

[TABLE]

By the fact that

[TABLE]

we have

[TABLE]

We deduce that

[TABLE]

This proves the proposition.

6.3 Proof of Proposition 2

Recall the expansion of $\sigma_{d}$ in terms of Gegenbauer polynomials, see Eqs. (51) and (52). From the properties of Gegenbauer polynomials, we have

[TABLE]

We can therefore decompose ${\bm{U}}$ :

[TABLE]

where ${\bm{W}}_{k}=(W_{k,ij})_{i,j\in[N]}$ with $W_{k,ij}=Q_{k}(\langle{\bm{\theta}}_{i},{\bm{\theta}}_{j}\rangle)$ .

Define

[TABLE]

Note that

[TABLE]

where $\hat{\sigma}$ is given by

[TABLE]

As a result, we have $\hat{\bm{U}}\succeq 0$ , and hence

[TABLE]

In the following, we give a lower bound for $\bar{\bm{U}}$ . Note we have

[TABLE]

By Proposition 3, we have

[TABLE]

Further we have

[TABLE]

For $d$ sufficiently large, there exists $C>0$ such that for any $p\geq m\equiv 2\ell+3$ :

[TABLE]

Hence, there exists constant $C^{\prime}$ , such that for large $d$ , we have

[TABLE]

Recalling that $B(d,2\ell+3)=\Theta_{d}(d^{2\ell+3})$ , and $N=o_{d}(d^{\ell+1})$ , we deduce

[TABLE]

Combining Eq. (54) and (55) we get

[TABLE]

Plug Eq. (56) into Eq. (53), we get with high probability

[TABLE]

Hence the proposition follows.

6.4 Proof of Proposition 3

Step 1. Bounding operator norm by moments.

We define ${\bm{\Delta}}={\bm{W}}-{\mathbf{I}}_{d}$ . Then we have

[TABLE]

For any sequence of integers $p=p(d)$ , we have

[TABLE]

To prove the proposition, it suffices to show that for any sequence $A_{d}\to\infty$ , we have

[TABLE]

In the following, we calculate $\mathbb{E}[{\rm Tr}({\bm{\Delta}}^{2p})]$ . We have

[TABLE]

To calculate this quantity, we will apply repeatedly the following identity, which is an immediate consequence of Eq. (33). For any $i_{1},i_{2},i_{3}$ distinct, we have

[TABLE]

Throughout the proof, we will denote by $C,C^{\prime},C^{\prime\prime}$ constants that may depend on $k$ but not on $p,d,N$ . The value of these constants is allowed to change from line to line.

**Step 2. The induced graph and equivalence of index sequences. **

For any index sequence ${\bm{i}}=(i_{1},i_{2},\ldots,i_{2p})\in[N]^{2p}$ , we defined an undirected multigraph $G_{\bm{i}}=(V_{\bm{i}},E_{\bm{i}})$ associated to index sequence ${\bm{i}}$ . The vertex set $V_{\bm{i}}$ is the set of distinct elements in $i_{1},\ldots,i_{2p}$ . The edge set $E_{{\bm{i}}}$ is formed as follows: for any $j\in[2p]$ we add an edge between $i_{j}$ and $i_{j+1}$ (with convention $2p+1\equiv 1$ ). Notice that this could be a self-edge, or a repeated edge: $G_{\bm{i}}=(V_{\bm{i}},E_{\bm{i}})$ will be –in general– a multigraph. We denote $v({\bm{i}})=|V_{\bm{i}}|$ to be the number of vertices of $G_{\bm{i}}$ , and $e({\bm{i}})=|E_{\bm{i}}|$ to be the number of edges (counting multiplicities). In particular, $e({\bm{i}})=k$ for ${\bm{i}}\in[N]^{k}$ . We define

[TABLE]

For any two index sequences ${\bm{i}}_{1},{\bm{i}}_{2}$ , we say they are equivalent ${\bm{i}}_{1}\asymp{\bm{i}}_{2}$ , if the two graphs $G_{{\bm{i}}_{1}}$ and $G_{{\bm{i}}_{2}}$ are isomorphic, i.e. there exists an edge-preserving bijection of their vertices (ignoring vertex labels). We denote the equivalent class of ${\bm{i}}$ to be

[TABLE]

We define the quotient set ${\mathcal{Q}}(p)$ by

[TABLE]

For any integer $k\geq 2$ and ${\bm{i}}=(i_{1},\ldots,i_{k})\in[N]^{k}$ , we define

[TABLE]

Lemma 2.

The following properties holds for all sufficiently large $N$ and $d$ :

$(a)$

For any equivalent index sequences ${\bm{i}}=(i_{1},\ldots,i_{2p})\asymp{\bm{j}}=(j_{1},\ldots,j_{2p})$ , we have $M_{{\bm{i}}}=M_{{\bm{j}}}$ .

$(b)$

For any index sequence ${\bm{i}}\in[N]^{2p}\setminus{\mathcal{T}}_{\star}(p)$ , we have $M_{{\bm{i}}}=0$ .

$(c)$

For any index sequence ${\bm{i}}\in{\mathcal{T}}_{\star}(p)$ , the degree of any vertex in $G_{\bm{i}}$ must be even.

$(d)$

The number of equivalent classes $|{\mathcal{Q}}(p)|\leq(2p)^{2p}$ .

$(e)$

Recall that $v({\bm{i}})=|V_{\bm{i}}|$ denotes the number of distinct elements in ${\bm{i}}$ . Then, for any ${\bm{i}}\in[N]^{2p}$ , the number of elements in the corresponding equivalence class satisfies $|{\mathcal{C}}({\bm{i}})|\leq v({\bm{i}})^{v({\bm{i}})}\cdot N^{v({\bm{i}})}\leq p^{p}N^{v({\bm{i}})}$ .

Proof.

Properties $(a)$ , $(b)$ and $(c)$ are straightforward. Note that $v({\bm{i}})\leq 2p$ for any ${\bm{i}}\in[N]^{2p}$ . For property $(d)$ , notice that to each distinct equivalence class we can associate, in an injective manner, a string of length $2p$ over an alphabet of size $2p$ (simply follow the elements in ${\bm{i}}$ in order, and replace the labels by some canonical ones, e.g. $\{1,2,3,\dots\}$ in order of appearance). Therefore the number of classes is bounded as

[TABLE]

For property $(e)$ , we need to bound the number of elements in ${\mathcal{C}}({\bm{i}})$ for representative ${\bm{i}}$ with degree $v({\bm{i}})$ . Define a mapping $\psi:{\mathcal{C}}({\bm{i}})\to[N]^{v({\bm{i}})}$ as follows. For ${\bm{i}}\in[N]^{2p}$ , $\psi({\bm{i}})$ is a vector of the distinct elements in ${\bm{i}}$ , listed in increasing order. For any ${\bm{k}}\in[N]^{v({\bm{i}})}$ , the pre-image $\psi^{-1}({\bm{k}})$ contains at most $v({\bm{i}})!\leq v({\bm{i}})^{v({\bm{i}})}$ elements. As a result, we have

[TABLE]

This proves property $(e)$ . ∎

In view of property $(a)$ in the last lemma, given an equivalence class ${\mathcal{C}}={\mathcal{C}}({\bm{i}})$ , we will write $M_{{\mathcal{C}}}=M_{{\bm{i}}}$ for the corresponding value common to the equivalence class ${\mathcal{C}}$ .

**Step 3. The skeletonization process. **

For multi-graph $G$ , we say that one of its vertices is redundant, if it has degree 2. For any index sequence ${\bm{i}}\in{\mathcal{T}}_{\star}(p)\subset[N]^{2p}$ (i.e. such that $G_{\bm{i}}$ does not have self-edges), we denote by $r({\bm{i}})\in\mathbb{N}_{+}$ to be the redundancy of ${\bm{i}}$ , and by ${\rm sk}({\bm{i}})$ to be the skeleton of ${\bm{i}}$ , both defined by the following skeletonization process. Let ${\bm{i}}_{0}={\bm{i}}\in[N]^{2p}$ . For any integer $s\geq 0$ , if $G_{{\bm{i}}_{s}}$ has no redundant vertices then stop and set ${\rm sk}({\bm{i}})={\bm{i}}_{s}$ . Otherwise, select a redundant vertex ${\bm{i}}_{s}(\ell)$ arbitrarily (the $\ell$ -th element of ${\bm{i}}_{s}$ ). If ${\bm{i}}_{s}(\ell-1)\neq{\bm{i}}_{s}(\ell+1)$ , then remove ${\bm{i}}_{s}(\ell)$ from the graph (and from the sequence), together with its adjacent edges, and connect ${\bm{i}}_{s}(\ell-1)$ and ${\bm{i}}_{s}(\ell+1)$ with an edge, and denote ${\bm{i}}_{s+1}$ to be the resulting index sequence, i.e., ${\bm{i}}_{s+1}=({\bm{i}}_{s}(1),\ldots,{\bm{i}}_{s}(\ell-1),{\bm{i}}_{s}(\ell+2),\ldots,{\bm{i}}_{s}({\rm end}))$ . If ${\bm{i}}_{s}(\ell-1)={\bm{i}}_{s}(\ell+1)$ , then remove ${\bm{i}}_{s}(\ell)$ from the graph (and from the sequence), together with its adjacent edges, and denote ${\bm{i}}_{s+1}$ to be the resulting index sequence, i.e., ${\bm{i}}_{s+1}=({\bm{i}}_{s}(1),\ldots,{\bm{i}}_{s}(\ell-1),{\bm{i}}_{s}(\ell+1),{\bm{i}}_{s}(\ell+2),\ldots,{\bm{i}}_{s}({\rm end}))$ . (Here $\ell+1$ , and $\ell-1$ have to be interpreted modulo $|{\bm{i}}_{s}|$ , the length of ${\bm{i}}_{s}$ .) The redundancy of ${\bm{i}}$ , denoted by $r({\bm{i}})$ , is the number of vertices removed during the skeletonization process.

It is easy to see that the outcome of this process is independent of the order in which we select vertices.

Example 1.

For illustration, we give two examples of skeletonization processes:

•

Let ${\bm{i}}=(1,2,1,3,4,3)$ , and set ${\bm{i}}_{0}={\bm{i}}$ . First notice that $\{2,4\}$ are redundant vertices and we can remove them in arbitrary order to get ${\bm{i}}_{2}=(1,3)$ . Then notice that $3$ is redundant whence we get ${\bm{i}}_{3}=\{1\}$ . Hence we have $r({\bm{i}})=3$ , and ${\rm sk}({\bm{i}})=(1)$ .

•

Consider the skeletonization process of ${\bm{j}}=(1,2,3,2,4,3)$ . Take ${\bm{j}}_{0}={\bm{j}}$ . First notice that $\{1,4\}$ are redundant vertices and can be removed in arbitrary order to get ${\bm{j}}_{2}=(2,3,2,3)$ . We see that there is no further redundant vertex in $G_{{\bm{j}}_{1}}$ , so that $r({\bm{j}})=2$ , and ${\rm sk}({\bm{j}})={\bm{j}}_{1}=(2,3,2,3)$ .

Lemma 3.

For the above skeletonization process, the following properties hold:

$(a)$

If ${\bm{i}}\asymp{\bm{j}}\in[N]^{p}$ , then ${\rm sk}({\bm{i}})\asymp{\rm sk}({\bm{j}})$ . That is, the skeletons of equivalent index sequences are equivalent.

$(b)$

For any ${\bm{i}}=(i_{1},\ldots,i_{k})\in[N]^{k}$ , define

[TABLE]

Then we have

[TABLE]

$(c)$

For any ${\bm{i}}\in{\mathcal{T}}_{\star}(p)\subset[N]^{2p}$ , its skeleton is either formed by a single element, or an index sequence whose graph has the property that every vertex has degree greater or equal to $4$ .

Proof.

Property $(a)$ holds by the definition of equivalence which is graph isomorphism. Property $(b)$ used the fact that, if $i\neq j_{1}$ and $i\neq j_{2}$ , we have

[TABLE]

so that deleting a redundant vertex will contribute a $1/B(d,k)$ factor.

To show property $(c)$ , note that any intermediate index sequence ${\bm{i}}_{s}$ in the skeletonization process is such that $G_{{\bm{i}}_{s}}$ only has even degree vertices, is connected, and has no self-edges (by induction). Hence, $G_{{\rm sk}({\bm{i}})}$ only has even degree vertices, is connected, and has no self-edges. Note that $G_{{\rm sk}({\bm{i}})}$ cannot have degree-2 vertices, and has at least one vertex (because the last vertex is not removed). Therefore, as long as ${\rm sk}({\bm{i}})$ contains at least two vertices, $G_{{\rm sk}({\bm{i}})}$ can only contain vertices with degree greater or equal to $4$ . ∎

Given an index sequence ${\bm{i}}\in{\mathcal{T}}_{\star}(p)\subset[N]^{2p}$ , we say ${\bm{i}}$ is of type 1, if ${\rm sk}({\bm{i}})$ contains only one index. We say ${\bm{i}}$ is of type 2 if ${\rm sk}({\bm{i}})$ has more than one index (so that by Lemma 3, $G_{{\rm sk}({\bm{i}})}$ can only contain vertices with degree greater or equal to $4$ ). Denote the class of type 1 index sequence (respectively type 2 index sequence) by ${\mathcal{T}}_{1}(p)$ (respectively ${\mathcal{T}}_{2}(p)$ ). We also denote by $\widetilde{\mathcal{T}}_{a}(p)$ , $a\in\{1,2\}$ the set of equivalence classes of sequences in ${\mathcal{T}}_{a}(p)$ . This definition makes sense since the equivalence class of the skeleton of a sequence only depends on the equivalence class of the sequence itself.

**Step 4. Type 1 index sequences. **

Recall that $v({\bm{i}})$ is the number of vertices in $G_{\bm{i}}$ , and $e({\bm{i}})$ is the number of edges in $G_{\bm{i}}$ (which coincides with the length of ${\bm{i}}$ ). We consider ${\bm{i}}\in{\mathcal{T}}_{1}(p)$ . Since for ${\bm{i}}\in{\mathcal{T}}_{1}(p)$ , every edge of $G_{\bm{i}}$ must be at most a double edge. Indeed, if $(u_{1},u_{2})$ had multiplicity larger than $2$ in $G_{{\bm{i}}}$ , neither $u_{1}$ nor $u_{2}$ could be deleted during the skeletonization process, contradicting the assumption that ${\rm sk}({\bm{i}})$ contains a single vertex. Therefore, we must have $\min_{{\bm{i}}\in{\mathcal{T}}_{1}}v({\bm{i}})=p+1$ . According the Lemma 3. $(b)$ , for every ${\bm{i}}\in{\mathcal{T}}_{1}(p)$ , we have

[TABLE]

Note by Lemma 2. $(e)$ , the number of elements in the equivalence class of ${\bm{i}}$ is $|{\mathcal{C}}({\bm{i}})|\leq p^{p}\cdot N^{v({\bm{i}})}$ . Hence we get

[TABLE]

Therefore

[TABLE]

where in the last step we used Lemma 2 and the fact that $B(d,k)\geq C_{0}d^{k}$ for some $C_{0}>0$ .

**Step 5. Type 2 index sequences. **

We have the following simple lemma bounding $M_{\bm{i}}$ . This bound is useful when ${\bm{i}}$ is a skeleton.

Lemma 4.

There exists constants $C$ and $d_{0}$ depending uniquely on $k$ such that, for any $d\geq d_{0}(k)$ , and any index sequence ${\bm{i}}\in[N]^{m}$ with $2\leq m\leq d/(4k)$ , we have

[TABLE]

Proof.

By Holder’s inequality, we have

[TABLE]

The lemma following by the claim that (for $d\geq d_{0}(k)$ )

[TABLE]

In the following, we will write ${\rm Coeff}\{q(x);x^{\ell}\}$ for the coefficient of $x^{\ell}$ in the polynomial $q(x)$ . To show the above claim, recall that we have, for any $\ell$ ,

[TABLE]

Therefore there exists a constant $C_{0}$ such that for all $d$ large enough

[TABLE]

As a consequence, for any integer $m$ , we have

[TABLE]

Define the random variable $G_{d}=\langle{\bm{\theta}}_{i},{\bm{\theta}}_{j}\rangle/\sqrt{d}$ for ${\bm{\theta}}_{i},{\bm{\theta}}_{j}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ . The probability distribution of $G_{d}$ is given by $\tau^{1}_{d-1}$ given in Eq. (78) below. Hence defining $A_{d}\equiv\Gamma(d-1)/(2^{d-2}\sqrt{d}\,\Gamma((d-1)/2)^{2})$ , we have (since $A_{d}\leq 1$ for all $d$ large enough)

[TABLE]

where $G\sim{\sf N}(0,1)$ . Therefore, for all $\ell\leq d/2$ ,

[TABLE]

Combining the above two upper bounds (63) and (64), we have

[TABLE]

By noting that $B(d,k)\geq C_{0}d^{k}$ for some $C_{0}>0$ , this proves the claim. ∎

Suppose ${\bm{i}}\in{\mathcal{T}}_{2}(p)$ , and denote $v({\bm{i}})$ to be the number of vertices in $G_{\bm{i}}$ . We have, for a sequence $p=o_{d}(d)$

[TABLE]

Here $(1)$ holds by Lemma 3. $(b)$ ; $(2)$ by Lemma 4, and the fact that ${\rm sk}({\bm{i}})\in[N]^{e({\rm sk}({\bm{i}}))}$ , together by $B(d,k)\geq C_{0}d^{k}$ ; $(3)$ because $e({\rm sk}({\bm{i}}))\leq 2p$ ; $(4)$ by Lemma 3. $(c)$ , implying that for ${\bm{i}}\in{\mathcal{T}}_{2}(p)$ , each vertex of $G_{{\rm sk}({\bm{i}})}$ has degree greater or equal to $4$ , so that $v({\rm sk}({\bm{i}}))\leq e({\rm sk}({\bm{i}}))/2$ (notice that for $d\geq d_{0}(k)$ we can assume $Cp/d<1$ ). Finally, $(5)$ follows since $r({\bm{i}}),v({\rm sk}({\bm{i}}))\leq v({\bm{i}})$ , and $(6)$ the definition of $r({\bm{i}})$ implying $r({\bm{i}})=v({\bm{i}})-v({\rm sk}({\bm{i}}))$ .

Note by Lemma 2. $(e)$ , the number of elements in equivalent class $|{\mathcal{C}}({\bm{i}})|\leq p^{v({\bm{i}})}\cdot N^{v({\bm{i}})}$ . Since $v({\bm{i}})$ depends only on the equivalence class of ${\bm{i}}$ , we will write, with a slight abuse of notation $v({\bm{i}})=v({\mathcal{C}}({\bm{i}}))$ . Notice that the number of equivalence classes with $v({\mathcal{C}})=v$ is upper bounded by the number multi-graphs with $v$ vertices and $2p$ edges, which is at most $v^{4p}$ . Hence we get

[TABLE]

Define $\varepsilon=CNp^{k+1}/d^{k}$ . We will assume hereafter that $p$ is selected such that

[TABLE]

By calculus and condition (68), the function $F(v)=v^{4p}\varepsilon^{v}$ is maximized over $v\in[2,2p]$ at $v=2$ , whence

[TABLE]

**Step 6. Concluding the proof. **

Using Eqs. (61) and (69), we have, for any $p=o_{d}(d)$ satisfying Eq. (68), we have

[TABLE]

Form Eq. (57), we obtain

[TABLE]

Finally setting $N=d^{k}e^{-2A\sqrt{\log d}}$ and $p=(k/A)\sqrt{\log d}$ , this yields

[TABLE]

Therefore, as long as $A\to\infty$ , we have $\mathbb{E}[\|{\bm{\Delta}}\|_{{\rm op}}]\to 0$ . It is immediate to check that the above choice of $p$ satisfies the required conditions $p=o_{d}(d)$ and Eq. (68) for all $d$ large enough.

7 Proof of Theorem 1.(b): RF model upper bound

Recall that $({\bm{w}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1})$ independently. We define ${\bm{\theta}}_{i}=\sqrt{d}\cdot{\bm{w}}_{i}$ for $i\in[N]$ , so that $({\bm{\theta}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ independently. Let ${\bm{W}}=({\bm{w}}_{1},\ldots,{\bm{w}}_{N})$ , and ${\bm{\Theta}}=({\bm{\theta}}_{1},\ldots,{\bm{\theta}}_{N})$ . We denote $\mathbb{E}_{\bm{\theta}}$ to be the expectation operator with respect to ${\bm{\theta}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , $\mathbb{E}_{\bm{x}}$ to be the expectation operator with respect to ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , and $\mathbb{E}_{\bm{w}}$ to be the expectation operator with respect to ${\bm{w}}\sim{\sf Unif}(\mathbb{S}^{d-1}(1))$ .

Without loss of generality, assume that $\{f_{d}\}_{d\geq 0}$ are polynomials of degree at most $\ell$ , i.e. $f_{d}={\mathsf{P}}_{\leq\ell}f_{d}$ . We denote the expansion of $\sigma_{d}$ in terms of Gegenbauer polynomials by (for ${\bm{\theta}},{\bm{x}}\in\mathbb{S}^{d-1}(\sqrt{d})$ )

[TABLE]

where

[TABLE]

Denote $\mathcal{L}=L^{2}(\mathbb{S}^{d-1}(\sqrt{d})\rightarrow\mathbb{R})$ . We introduce the operator ${\mathbb{T}}:{\mathcal{L}}\to{\mathcal{L}}$ , such that for any $g\in{\mathcal{L}}$

[TABLE]

In particular, for any $k\in\mathbb{N}$ and $1\leq u\leq B(d,k)$ , we have

[TABLE]

It is easy to check that ${\mathbb{T}}^{*}$ (the adjoint operator) has the same expression as ${\mathbb{T}}$ with ${\bm{x}}$ and ${\bm{\theta}}$ swapped. We define the operator ${\mathbb{K}}:{\mathcal{L}}\to{\mathcal{L}}$ as ${\mathbb{K}}\equiv{\mathbb{T}}{\mathbb{T}}^{*}$ . For any $g\in{\mathcal{L}}$ , we have

[TABLE]

where

[TABLE]

We will restrict ourselves to the subspace $V_{d,\leq\ell}$ of polynomials of degree less or equal to $\ell$ . We have for $0\leq k\leq\ell$ and $1\leq u\leq B(d,k)$ ,

[TABLE]

Hence $\{Y^{(d)}_{ku}\}_{0\leq k\leq\ell,1\leq u\leq B(d,k)}$ is an orthogonal basis that diagonalizes ${\mathbb{K}}$ on $V_{d,\leq\ell}$ . By Assumption 1.(b), we deduce that ${\mathbb{K}}$ is a bijection from $V_{d,\leq\ell}$ to itself for $d$ sufficiently large. In particular, its restricted inverse ${\mathbb{K}}^{-1}|_{V_{d,\leq\ell}}$ is well defined.

Consider $\hat{f}_{{\sf RF}}({\bm{x}};{\bm{\Theta}},{\bm{a}})=\sum_{i=1}^{N}a_{i}\sigma_{d}(\langle{\bm{\theta}}_{i},{\bm{x}}\rangle/\sqrt{d})$ . We can expand the risk achieved at parameter ${\bm{a}}$ as

[TABLE]

Let us define $\alpha({\bm{\theta}})\equiv({\mathbb{K}}^{-1}{\mathbb{T}}f_{d})({\bm{\theta}})$ and choose $a_{i}=N^{-1}\alpha({\bm{\theta}}_{i})$ . We consider the expectation over ${\bm{\Theta}}$ of the RF risk:

[TABLE]

It is easy to check that ${\mathbb{T}}^{*}{\mathbb{K}}^{-1}{\mathbb{T}}|_{V_{d,\leq\ell}}={\mathbf{I}}|_{V_{d,\leq\ell}}$ . Hence

[TABLE]

Recall the decomposition of $f_{d}$ in terms of spherical harmonics (and note we assumed $f_{d}$ is a degree $\ell$ polynomial)

[TABLE]

and the equations (74) and (75), we get

[TABLE]

As a result, we deduce that

[TABLE]

Hence, by Assumption 1.(b), and from the assumption that $N=\omega_{d}(d^{\ell})$ , we deduce that the risk $R_{{\sf RF}}(f_{d},{\bm{\Theta}}/\sqrt{d})/\|f_{d}\|_{L^{2}}^{2}$ converges in $L^{1}$ to [math], and therefore in probability.

8 Proof of Theorem 2.(a): NT model lower bound

8.1 Preliminaries

We begin with some notations and simple remarks.

Lemma 5.

Assume $\sigma$ is an activation function with $\sigma(u)^{2}\leq c_{0}\,\exp(c_{1}\,u^{2}/2)$ for some constants $c_{0}>0$ and $c_{1}<1$ . Then

$(a)$

$\mathbb{E}_{G\sim{\sf N}(0,1)}[\sigma(G)^{2}]<\infty$ . 2. $(b)$

Let $\|{\bm{w}}\|_{2}=1$ . Then there exists $d_{0}=d_{0}(c_{1})$ such that, for ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ ,

[TABLE] 3. $(c)$

Let $\|{\bm{w}}\|_{2}=1$ . Then there exists a coupling of $G\sim{\sf N}(0,1)$ and ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ such that

[TABLE]

Proof.

Claim 1 is obvious.

For claim 2, note that the probability distribution of $\langle{\bm{w}},{\bm{x}}\rangle$ when ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ is given by

[TABLE]

A simple calculation shows that $C_{d}\to(2\pi)^{-1/2}$ as $d\to\infty$ , and hence $\sup_{d}C_{d}\leq\overline{C}<\infty$ . Therefore

[TABLE]

where the last inequality holds provided $d\geq d_{0}=10/(1-c_{1})$ .

Finally, for point 3, without loss of generality we will take ${\bm{w}}={\bm{e}}_{1}$ , so that $\langle{\bm{w}},{\bm{x}}\rangle=x_{1}$ . By the same argument given above (and since both $G$ and $x_{1}$ have densities bounded uniformly in $d$ ), for any $M>0$ we can choose $\sigma_{M}$ bounded continuous so that for any $d$ ,

[TABLE]

It is therefore sufficient to prove the claim for $\sigma_{M}$ . Letting ${\bm{\xi}}\sim{\sf N}(0,{\mathbf{I}}_{d-1})$ , independent of $G$ , we construct the coupling via

[TABLE]

where we set ${\bm{x}}=(x_{1},{\bm{x}}^{\prime})$ . We thus have $x_{1}\to G$ almost surely, and the claim follows by weak convergence. ∎

We denote the Hermite decomposition of $\sigma$ by

[TABLE]

We state separately the assumptions of Theorem 2.(a) for future reference.

Assumption 4 (Integrability condition).

The activation function $\sigma$ is weakly differentiable with weak derivative $\sigma^{\prime}$ . There exist constants $c_{0}$ , $c_{1}$ , with $c_{0}>0$ and $c_{1}<1$ such that, for all $u\in{\mathbb{R}}$ , $\sigma^{\prime}(u)^{2}\leq c_{0}\,\exp(c_{1}u^{2}/2)$ .

Assumption 5 (Level- $\ell$ non-trivial Hermite components).

Recall that $\mu_{k}(h)\equiv\mathbb{E}_{G\sim{\sf N}(0,1)}[h(G){\rm He}_{k}(G)]$ denote the $k$ -th coefficient of the Hermite expansion of $h\in L_{2}({\mathbb{R}},\gamma)$ (with $\gamma$ the standard Gaussian measure).

Then there exists $k_{1},k_{2}\geq 2\ell+7$ such that $\mu_{k_{1}}(\sigma^{\prime}),\mu_{k_{2}}(\sigma^{\prime})\neq 0$ and

[TABLE]

It is also useful to notice that the Hermite coefficients of $x^{2}\sigma^{\prime}(x)$ can be computed from the ones of $\sigma^{\prime}(x)$ using the relation $\mu_{k}(x^{2}\sigma^{\prime})=\mu_{k+2}(\sigma^{\prime})+[1+2k]\mu_{k}(\sigma^{\prime})+k(k-1)\mu_{k-2}(\sigma^{\prime})$ .

8.2 Proof of Theorem 2.(a): Outline

The proof for the NT model follows the same scheme as for the RF case. However, several steps are technically more challenging. We will follow the same notations introduced in Section 6.1. In particular $\mathbb{E}_{{\bm{x}}},\mathbb{E}_{{\bm{w}}},\mathbb{E}_{{\bm{\theta}}}$ will denote, respectively, expectation with respect to ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , ${\bm{w}}\sim{\sf Unif}(\mathbb{S}^{d-1}(1))$ , ${\bm{\theta}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ .

We define the random vector ${\bm{V}}=({\bm{V}}_{1},\ldots,{\bm{V}}_{N})^{\mathsf{T}}\in{\mathbb{R}}^{Nd}$ , where, for each $j\leq N$ , ${\bm{V}}_{j}\in{\mathbb{R}}^{d}$ , and analogously ${\bm{V}}_{\leq\ell+1}=({\bm{V}}_{1,\leq\ell+1},\ldots,{\bm{V}}_{N,\leq\ell+1})^{\mathsf{T}}\in{\mathbb{R}}^{Nd}$ , ${\bm{V}}_{>\ell+1}=({\bm{V}}_{1,>\ell+1},\ldots,{\bm{V}}_{N,>\ell+1})^{\mathsf{T}}\in{\mathbb{R}}^{Nd}$ , as follows

[TABLE]

We define the random matrix ${\bm{U}}=({\bm{U}}_{ij})_{i,j\in[N]}\in{\mathbb{R}}^{Nd\times Nd}$ , where, for each $i,j\leq N$ , ${\bm{U}}_{ij}\in{\mathbb{R}}^{d\times d}$ , is given by

[TABLE]

Proceeding as for the RF model, we obtain

[TABLE]

We claim that we have

[TABLE]

This is achieved in the following two propositions.

Proposition 4 (Expected norm of ${\bm{V}}$ ).

Let $\sigma$ be an activation function satisfying Assumption 4. Define

[TABLE]

where expectation is with respect to ${\bm{x}},{\bm{x}}^{\prime}\sim_{i.i.d.}{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Then there exists a constant $C$ (depending only on the constants in Assumption 4) such that, for any $\ell\geq 1$ and $d\geq 6$ ,

[TABLE]

Proposition 5 (Lower bound on the kernel matrix).

Let $N=o_{d}(d^{\ell+1})$ for some $\ell\in{\mathbb{Z}}_{>0}$ , and $({\bm{\theta}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ independently. Let $\sigma$ be an activation that satisfies Assumption 4 and Assumption 5. Let ${\bm{U}}\in\mathbb{R}^{Nd\times Nd}$ be the kernel matrix with $i,j$ block ${\bm{U}}_{ij}\in{\mathbb{R}}^{d\times d}$ defined by Eq. (86). Then there exists a constant $\varepsilon>0$ that depends on the activation function $\sigma$ , such that

[TABLE]

with high probability as $d\to\infty$ .

These two propositions will be proven in the next sections. Proposition 4 shows that

[TABLE]

Note $B(d,\ell+2)=\Theta_{d}(d^{\ell+2})$ , and $N=o_{d}(d^{\ell+1})$ . By Markov inequality, we have Eq. (87). Equation (88) follows simply by Proposition 5. This proves the theorem.

8.3 Proof of Proposition 4

We denote the Gegenbauer decomposition of $\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)$ by

[TABLE]

where

[TABLE]

By Lemma 5, applied to function $\sigma^{\prime}$ (instead of $\sigma$ ), under Assumption 4, we have $\|\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)\|_{L^{2}}^{2}\leq C$ (for $C$ a constant independent of $d$ ). We therefore have (recalling the normalization of the Gegenbauer polynomials in Eq. (32))

[TABLE]

We define the NT kernel by

[TABLE]

Then

[TABLE]

where in the last step we used Eq. (33). By the recurrence relationship for Gegenbauer polynomials (35), we have

[TABLE]

where

[TABLE]

We use the convention that $t_{d,-1}=0$ . This gives

[TABLE]

Hence we get

[TABLE]

where

[TABLE]

The last inequality follows by Eqs. (89) and (91).

We define

[TABLE]

Using the fact that the kernel $H$ preserve the decomposition (29), we have

[TABLE]

Note by Eq. (90), we have (as always, expectations are with respect to ${\bm{x}},{\bm{x}}^{\prime}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ independently)

[TABLE]

where the fourth equality used the fact that $\mathbb{E}_{{\bm{x}},{\bm{x}}^{\prime}}[Y_{kl}({\bm{x}})Q_{k}(\langle{\bm{x}},{\bm{x}}^{\prime}\rangle)Y_{ks}({\bm{x}}^{\prime})]=\delta_{ls}/B(d,k)$ .

Hence we have

[TABLE]

where we used the fact that $B(d,k)$ is non-decreasing in $k$ given by Lemma 1. This concludes the proof.

8.4 Proof of Proposition 5

8.4.1 Auxiliary lemmas

In the proof of this proposition, we will need the following lemmas.

Lemma 6.

Let $\psi:\mathbb{R}\to\mathbb{R}$ be a function such that $\psi(\langle{\bm{e}},\cdot\rangle)\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ and $\psi(\langle{\bm{e}},\cdot\rangle)\langle{\bm{e}},\cdot\rangle\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Let $\{\lambda_{k,d}(\psi)\}_{k=0}^{\infty}$ be the coefficients of its expansion in terms of the $d$ -th order Gegenbauer polynomials

[TABLE]

Then we can write

[TABLE]

with the new coefficients given by

[TABLE]

Proof.

We recall the following two formulas for $k\geq 1$ (see Section 5.2):

[TABLE]

Furthermore, we have $Q^{(d)}_{0}(x)=1$ , $Q^{(d)}_{1}(x)=x/d$ and therefore therefore $xQ^{(d)}_{0}(x)=dQ^{(d)}_{1}(x)$ . We insert these expressions in the expansion of the function $\psi$

[TABLE]

Matching the coefficients of the expansion yields

[TABLE]

∎

Similarly, we can write the decomposition of $x^{2}\psi(x)$ to be

[TABLE]

where the coefficients are given by the same relation as in the above lemma

[TABLE]

Lemma 7.

Let ${\bm{u}}:\mathbb{S}^{d-1}(\sqrt{d})\times\mathbb{S}^{d-1}(\sqrt{d})\to\mathbb{R}^{d\times d}$ be a matrix-valued function defined by

[TABLE]

Then there exist functions $u_{1},u_{2},u_{3}:[-1,1]\to\mathbb{R}$ such that

[TABLE]

Proof.

Case 1: ${\bm{\theta}}_{1}\neq{\bm{\theta}}_{2}$ .

We first consider the case ${\bm{\theta}}_{1}\neq{\bm{\theta}}_{2}$ . We will denote $\gamma=\langle{\bm{\theta}}_{1},{\bm{\theta}}_{2}\rangle/d<1$ for convenience. Given any three functions $u_{1},u_{2},u_{3}:(-1,1)\to\mathbb{R}$ , we define

[TABLE]

Let us rotate ${\bm{u}}$ and $\tilde{\bm{u}}$ such that ${\bm{\theta}}_{1}=(\sqrt{d},0,\ldots,0)$ and ${\bm{\theta}}_{2}=(\gamma\sqrt{d},\sqrt{1-\gamma^{2}}\sqrt{d},0,\ldots,0)$ . We can rewrite

[TABLE]

where

[TABLE]

Similarly, we can write

[TABLE]

where

[TABLE]

We check in both cases that:

[TABLE]

We conclude that ${\bm{u}}$ and $\tilde{\bm{u}}$ are equal if and only if

[TABLE]

We can therefore choose for $\gamma<1$

[TABLE]

Case 2: ${\bm{\theta}}_{1}={\bm{\theta}}_{2}$ .

Similarly, for some fixed $\alpha$ and $\beta$ , we define

[TABLE]

We can show that the matrices ${\bm{u}}$ and $\tilde{\bm{u}}$ are equal if and only if

[TABLE]

We can therefore fix $u_{1}(1)=\alpha$ and $u_{2}(1)+u_{3}(1)=\beta/2$ . ∎

Lemma 8.

Let $\sigma$ be an activation function such that $\sigma(u)\leq c_{0}\exp(c_{1}u^{2})$ for some constants $c_{0},c_{1}$ , with $c_{1}<1$ . Let the Hermite and Gegenbauer decompositions of $\sigma$ be

[TABLE]

Then we have for any fixed $k$ ,

[TABLE]

Proof.

Recall the correspondence (43) between Gegenbauer and Hermite polynomials. Note for any monomial $m_{k}(x)=x^{k}$ , by Lemma 5. $(c)$ , we have

[TABLE]

This gives for any fixed $k$ , we have

[TABLE]

This proves the lemma. ∎

Lemma 9.

For any fixed $k$ , let $Q_{k}^{(d)}(x)$ be the $k$ -th Gegenbauer polynomial. We expand

[TABLE]

Then we have

[TABLE]

Proof.

Using the correspondence (43) between Gegenbauer and Hermite polynomials we have

[TABLE]

This gives

[TABLE]

This proves the lemma. ∎

Lemma 10.

Let $N=o_{d}(d^{\ell+1})$ for a fixed integer $\ell$ . Let $({\bm{w}}_{i})_{i\in[N]}\sim{\sf Unif}(\mathbb{S}^{d-1})$ independently. Denote a matrix ${\bm{\Delta}}^{(k)}=(\Delta_{ij}^{(k)})_{i,j\in[N]}$ with

[TABLE]

Then as $d\to\infty$ , we have

[TABLE]

Proof.

Let us consider ${\bm{w}}\sim{\sf Unif}(\mathbb{S}^{d-1})$ , and $w_{1}$ its first coordinate. We have $w_{1}$ which has density $f(x)=(C_{d}\sqrt{d})(1-x^{2})^{(d-3)/2}$ on $[-1,+1]$ , cf. Eq. (78):

[TABLE]

where the last inequality holds for all $d$ large enough, since $C_{d}\to(2\pi)^{-1/2}$ as $d\to\infty$ . Hence, we have:

[TABLE]

Taking $t=O(\log(d)^{1/2}d^{-1/2})$ , we get

[TABLE]

Using the following bound:

[TABLE]

which concludes the proof. ∎

8.4.2 Proof of Proposition 5

Step 1. Construction of the activation function $\hat{\sigma}$ .

By Assumption 4 and Lemma 5 (applied to $\sigma^{\prime}$ instead of $\sigma$ ), we have $\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)\in L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ and we consider its expansion in terms of Gegenbauer polynomials (as always, expectation is taken with respect to ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ with $x_{1}=\langle{\bm{x}},{\bm{e}}_{1}\rangle$ ):

[TABLE]

Let $k_{2}>k_{1}\geq 2\ell+7$ be two indices that satisfy the conditions of Assumption 5. Using the Gegenbauer coefficients of $\sigma^{\prime}$ , we define $\hat{\sigma}^{\prime}:[-d,d]\to{\mathbb{R}}$ by

[TABLE]

for some $\delta_{1},\delta_{2}$ that we will fix later (with $|\delta_{t}|\leq 1$ ).

Step 2. The functions ${\bm{u}},\hat{\bm{u}}$ and $\bar{\bm{u}}$ .

Let ${\bm{u}}$ and $\hat{\bm{u}}$ be the matrix-valued functions associated respectively to $\sigma^{\prime}$ and $\hat{\sigma}^{\prime}$

[TABLE]

From Lemma 7, there exists functions $u_{1},u_{2},u_{3}$ and $\hat{u}_{1},\hat{u}_{2},\hat{u}_{3}$ , such that

[TABLE]

We define $\bar{\bm{u}}={\bm{u}}-\hat{\bm{u}}$ . Then we can write

[TABLE]

where $\bar{u}_{k}=u_{k}-\hat{u}_{k}$ for $k=1,2,3$ .

Step 3. Construction of the kernel matrices.

Let ${\bm{U}},\hat{\bm{U}},\bar{\bm{U}}\in\mathbb{R}^{Nd\times Nd}$ with $i,j$ -th block (for $i,j\in[N]$ ) given by

[TABLE]

Note that we have ${\bm{U}}=\hat{\bm{U}}+\bar{\bm{U}}$ . By Eq. (101) and (98), it is easy to see that $\hat{\bm{U}}\succeq 0$ . Then we have ${\bm{U}}\succeq\bar{\bm{U}}$ . In the following, we would like to lower bound matrix $\bar{\bm{U}}$ .

We decompose $\bar{\bm{U}}$ as

[TABLE]

where ${\bm{D}}\in\mathbb{R}^{dN\times dN}$ is a block-diagonal matrix, with

[TABLE]

and ${\bm{\Delta}}\in\mathbb{R}^{dN\times dN}$ is formed by blocks ${\bm{\Delta}}_{ij}\in{\mathbb{R}}^{d\times d}$ for $i,j\in[n]$ , defined by

[TABLE]

In the rest of the proof, we will prove that $\|{\bm{\Delta}}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ and for $\varepsilon$ small enough ${\bm{D}}\succeq\varepsilon{\mathbf{I}}_{Nd}$ with high probability.

Step 4. Prove that $\|{\bm{\Delta}}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ .

Denoting $\gamma_{ij}=\langle{\bm{\theta}}_{i},{\bm{\theta}}_{j}\rangle/d<1$ , we get, from Eq. (92),

[TABLE]

Using the notations of Lemma 6, we get

[TABLE]

We get similar expressions for $\hat{\bm{U}}_{ij}$ with $\lambda_{k,d}(\sigma^{\prime})$ replaced by $\lambda_{k,d}(\hat{\sigma}^{\prime})$ . Because we defined $\sigma^{\prime}$ and $\hat{\sigma}^{\prime}$ by only modifying the $k_{1}$ -th and $k_{2}$ -th coefficients, we get

[TABLE]

Recalling that $\lambda_{k,d}^{(1)}$ only depend on $\lambda_{k-1,d}$ and $\lambda_{k+1,d}$ (Lemma 6), we get

[TABLE]

By Assumption 4 and the convergence in Lemma 8, for any fixed $k$ ,

[TABLE]

Using the expression of $B(d,k)$ we get

[TABLE]

From Lemma 9, we recall that the coefficients of the $k$ -th Gegenbauer polynomial $Q_{k}^{(d)}(x)=\sum_{s=0}^{k}p^{(d)}_{k,s}x^{s}$ satisfy

[TABLE]

Furthermore, we have shown in Lemma 10 that $\max_{i\neq j}|\langle{\bm{\theta}}_{i},{\bm{\theta}}_{j}\rangle|=O_{d,\mathbb{P}}(\sqrt{d\log d})$ . We deduce that

[TABLE]

Plugging the estimates (109), (110) and (112) into Eqs. (107) and (108), we obtain that

[TABLE]

From Eq. (106), using the fact that $\max_{i\neq j}|\gamma_{ij}|=O_{d,\mathbb{P}}(\sqrt{(\log d)/d})$ and Cramer’s rule for matrix inversion, it is easy to see that

[TABLE]

We deduce from (113) (105) and (114) that

[TABLE]

As a result, combining Eq. (115) with Eq. (102) and (99), we get

[TABLE]

By the expression of ${\bm{\Delta}}$ given by (104), we conclude that

[TABLE]

Since $k_{1}\geq 2\ell+7$ , we deduce that $\|{\bm{\Delta}}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ .

Step 5. Proving that ${\bm{D}}\succeq\varepsilon{\mathbf{I}}_{Nd}$ .

By Lemma 7, we can express $\bar{\bm{U}}_{ii}$ by

[TABLE]

with $\alpha$ , $\beta$ independent of $i$ , and given by Eq. (93), namely

[TABLE]

(Notice that ${\rm Tr}(\bar{\bm{U}}_{ii})$ and $\langle{\bm{\theta}}_{i},\bar{\bm{U}}_{ii}{\bm{\theta}}_{i}\rangle$ are independent of $i$ by construction, cf. Eqs. (97), (98) and (100), (101).) By the definition of ${\bm{D}}$ given in Eq. (103), We deduce that:

[TABLE]

We claim that, under the assumptions of Proposition 5, and denoting ${\bm{\delta}}=(\delta_{1},\delta_{2})$ (where $\delta_{1},\delta_{2}$ first appears in the definition of $\hat{\sigma}$ in Eq. (96), and till now $\delta_{1},\delta_{2}$ are still not determined)

[TABLE]

where $F_{1}({\bm{0}})=F_{2}({\bm{0}})=0$ and $\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}})\neq{\bm{0}}$ , $\det(\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}}))\neq 0$ . Before proving this claim, let us show that it allows to finish the proof of Proposition 5. Since $\det(\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}}))\neq 0$ , there exists a unit-norm vector ${\bm{v}}$ , such that $\langle{\bm{v}},\nabla F_{1}({\bm{0}})\rangle>0$ , and $\langle{\bm{v}},\nabla F_{2}({\bm{0}})\rangle>0$ . Now we choose $\delta_{1},\delta_{2}$ (first appears in the definition of $\hat{\sigma}$ in Eq. (96)): we set ${\bm{\delta}}=(\delta_{1},\delta_{2})=\delta_{0}{\bm{v}}$ with some $\delta_{0}>0$ small enough. This yields $F_{1}({\bm{\delta}})>0$ , $F_{2}({\bm{\delta}})>0$ . Define $\varepsilon=\min(F_{1}({\bm{\delta}}),F_{2}({\bm{\delta}}))/2$ , we have

[TABLE]

and therefore, with high probability,

[TABLE]

We are left with the task of proving that the limits in Eqs. (118), (119) exist, with the desired properties. Using Eqs. (107) and (108), we get:

[TABLE]

Using Eq. (110), we get that the limits (118), (119) exist. Further, letting $\mu_{k}\equiv\mu_{k}(\sigma^{\prime})$ , we have

[TABLE]

while, for $k_{2}\neq k_{1}+2$

[TABLE]

while, for $k_{2}=k_{1}+2$

[TABLE]

It is easy to check $F_{1}({\bm{0}})=F_{2}({\bm{0}})=0$ , and to compute the gradients, using the identity $\mu_{k}(x^{2}\sigma^{\prime})=\mu_{k+2}(\sigma^{\prime})+(2k+1)\mu_{k}(\sigma^{\prime})+k(k-1)\mu_{k-2}(\sigma^{\prime})$ , we get

[TABLE]

Under Assumption 5, we have $\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}})\neq{\bm{0}}$ and $\det(\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}}))\neq 0$ completing the proof.

9 Proof of Theorem 2.(b): NT model upper bound

The proof for the NT model follows the same scheme as for the RF case. However, several steps are technically more challenging. We will follow the same notations introduced in Section 6.1. In particular $\mathbb{E}_{{\bm{x}}},\mathbb{E}_{{\bm{w}}},\mathbb{E}_{{\bm{\theta}}}$ will denote, respectively, expectation with respect to ${\bm{x}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , ${\bm{w}}\sim{\sf Unif}(\mathbb{S}^{d-1}(1))$ , ${\bm{\theta}}\sim{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ .

Let us assume that $\{f_{d}\}$ are polynomials of degree at most $\ell+1$ , i.e. $f_{d}={\mathsf{P}}_{\leq\ell+1}f_{d}$ .

Denote $\mathcal{L}=L^{2}(\mathbb{S}^{d-1}(\sqrt{d})\rightarrow\mathbb{R})$ and $\mathcal{L}_{d}=L^{2}(\mathbb{S}^{d-1}(\sqrt{d})\rightarrow\mathbb{R}^{d})$ . We introduce the operator ${\mathbb{T}}:{\mathcal{L}}\to{\mathcal{L}}_{d}$ , such that for any $g\in{\mathcal{L}}$ ,

[TABLE]

It easy to check that the adjoint operator ${\mathbb{T}}^{*}:{\mathcal{L}}_{d}\to{\mathcal{L}}$ verifies for any ${\bm{g}}\in{\mathcal{L}}_{d}$ ,

[TABLE]

We define the operator ${\mathbb{K}}:{\mathcal{L}}_{d}\to{\mathcal{L}}_{d}$ as ${\mathbb{K}}\equiv{\mathbb{T}}{\mathbb{T}}^{*}$ . For ${\bm{g}}\in{\mathcal{L}}_{d}$ , we can write

[TABLE]

where

[TABLE]

Furthermore, we define ${\mathbb{H}}:{\mathcal{L}}\to{\mathcal{L}}$ as ${\mathbb{H}}\equiv{\mathbb{T}}^{*}{\mathbb{T}}$ . For $g\in{\mathcal{L}}$ , we can write

[TABLE]

where

[TABLE]

and $\Gamma_{d,m}$ can be computed using the Gegenbauer recursion formula Eq. (35),

[TABLE]

with

[TABLE]

In particular, it is easy to check that

[TABLE]

We consider the subspace of ${\mathcal{L}}_{d}$ corresponding to ${\mathbb{T}}(V_{d,\leq\ell+1})$ , the image of $V_{d,\leq\ell+1}$ by operator ${\mathbb{T}}$ . One can check that $\{{\mathbb{T}}Y^{(d)}_{ku}\}_{0\leq k\leq\ell+1,1\leq u\leq B(d,k)}$ is an orthogonal basis of this subspace. Furthermore

[TABLE]

Hence this basis diagonalizes ${\mathbb{K}}$ . By Eq. (44), we have

[TABLE]

By Assumption 2.(b), we have $\Gamma_{d,k}\neq 0$ for any $k\leq\ell+1$ when $d$ is sufficiently large. Hence, the restricted inverse ${\mathbb{K}}^{-1}|_{{\mathbb{T}}(V_{d,\leq\ell+1})}$ is well defined for $d$ sufficiently large.

Consider $\hat{f}_{\sf NT}({\bm{x}};{\bm{\Theta}},{\bm{a}})=\sum_{i=1}^{N}\langle{\bm{a}}_{i},{\bm{x}}\rangle\sigma^{\prime}(\langle{\bm{\theta}}_{i},{\bm{x}}\rangle)$ . We can expand the risk at parameter ${\bm{a}}$ as

[TABLE]

Let us define ${\bm{\alpha}}({\bm{\theta}})\equiv{\mathbb{K}}^{-1}{\mathbb{T}}f_{d}({\bm{\theta}})$ and choose ${\bm{a}}_{i}=N^{-1}{\bm{\alpha}}({\bm{\theta}}_{i})$ . We consider the expectation over ${\bm{\Theta}}$ of the NT risk:

[TABLE]

It is easy to check that ${\mathbb{T}}^{*}{\mathbb{K}}^{-1}{\mathbb{T}}|_{V_{d,\leq\ell+1}}={\mathbf{I}}_{V_{d,\leq\ell+1}}$ . By Lemma 7, we have ${\bm{K}}({\bm{\theta}},{\bm{\theta}})=\alpha_{d}{\mathbf{I}}+\beta_{d}{\bm{\theta}}{\bm{\theta}}^{\mathsf{T}}$ with

[TABLE]

From Assumption 2.(a) and Lemma 5.(b) applied to $\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)$ and $\langle{\bm{e}},\cdot\rangle\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)$ , we get $\alpha_{d}=O_{d}(1)$ and $\beta_{d}=O_{d}(d^{-1})$ . We deduce that the operator norm verifies $\|{\bm{K}}({\bm{\theta}},{\bm{\theta}})\|_{{\rm op}}=\alpha_{d}+\beta_{d}\|{\bm{\theta}}\|_{2}^{2}=O_{d}(1)$ .

Hence, there exists a constant $C>0$ such that

[TABLE]

Using the decomposition of $f_{d}$ in terms of harmonic polynomials (note we assumed $f_{d}$ is a degree $\ell+1$ polynomial) and Eq. (127), we have

[TABLE]

By Eq. (128), for any fixed $k\leq\ell+1$ , we have $\Gamma_{d,k}=\Omega_{d}(d)$ . Hence we get

[TABLE]

Hence, from the assumption that $N=\omega_{d}(d^{\ell})$ , we deduce that $R_{{\sf NT}}(f_{d},{\bm{\Theta}}/\sqrt{d})/\|f_{d}\|_{L^{2}}^{2}$ converges in $L^{1}$ to [math], and therefore in probability.

10 Proof of Theorem 4: risk for KR

10.1 Proof of Theorem 4

**Step 1. Rewrite the ${\bm{y}}$ , ${\bm{E}}$ , ${\bm{H}}$ , ${\bm{M}}$ matrices. **

The test error of empirical kernel ridge regression gives

[TABLE]

where ${\bm{E}}=(E_{1},\ldots,E_{n})^{\mathsf{T}}$ , ${\bm{M}}=(M_{ij})_{ij\in[n]}$ and ${\bm{H}}=(H_{ij})_{ij\in[n]}$ with

[TABLE]

Let $B=\sum_{k=0}^{\ell}B(d,k)$ . Define

[TABLE]

Let the spherical harmonics decomposition of $f_{d}$ be

[TABLE]

and the Gegenbauer decomposition of $h_{d}$ be

[TABLE]

We decompose the vectors and matrices ${\bm{f}}$ , ${\bm{E}}$ , ${\bm{H}}$ , and ${\bm{M}}$ in terms of spherical harmonics

[TABLE]

By Proposition 3 and Eq. (56), the kernel ${\bm{H}}$ and ${\bm{M}}$ can be rewritten as

[TABLE]

where

[TABLE]

and

[TABLE]

Step 2. Decompose the risk

Recalling ${\bm{y}}={\bm{f}}+{\bm{\varepsilon}}$ , we decompose the risk as follows

[TABLE]

where

[TABLE]

Further, we denote ${\bm{f}}_{\leq\ell}$ , ${\bm{f}}_{>\ell}$ , ${\bm{E}}_{\leq\ell}$ , and ${\bm{E}}_{>\ell}$ ,

[TABLE]

Step 3. Term $T_{2}$

Note we have

[TABLE]

where

[TABLE]

By Lemma 13, we have

[TABLE]

hence

[TABLE]

By Lemma 11, we have (with $\|{\bm{\Delta}}\|_{2}=o_{d,\mathbb{P}}(1)$ )

[TABLE]

Moreover, we have

[TABLE]

As a result, we have

[TABLE]

By Eq. (129) again, we have

[TABLE]

By Lemma 12, we have

[TABLE]

Moreover

[TABLE]

This gives

[TABLE]

Using Cauchy Schwarz inequality for $T_{22}$ , we get

[TABLE]

As a result, combining Eqs. (130), (132) and (131), we have

[TABLE]

**Step 4. Term $T_{1}$ . **

Note we have

[TABLE]

where

[TABLE]

By Lemma 14, we have

[TABLE]

so that

[TABLE]

Using Cauchy Schwarz inequality for $T_{12}$ , and by the expression of ${\bm{M}}={\bm{Y}}_{\leq\ell}{\bm{D}}_{\leq\ell}^{2}{\bm{Y}}_{\leq\ell}^{\mathsf{T}}+\kappa_{u}({\mathbf{I}}_{n}+{\bm{\Delta}}_{u})$ with $\|{\bm{\Delta}}_{u}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ , we get with high probability

[TABLE]

For term $T_{13}$ , we have

[TABLE]

Note we have $\mathbb{E}[\|{\bm{f}}\|_{2}^{2}]=n\|f_{d}\|_{L^{2}}^{2}$ , and $\|({\bm{H}}+\lambda{\mathbf{I}}_{n})^{-1}\|_{{\rm op}}\leq 2/(\kappa_{h}+\lambda)$ with high probability, and

[TABLE]

As a result, we have

[TABLE]

where the last equality used the fact that $\omega_{d}(d^{\ell}\log d)\leq n\leq O_{d}(d^{\ell+1-\delta})$ and Assumption 3. Combining Eqs. (134), (135) and (136), we get

[TABLE]

**Step 5. Terms $T_{3},T_{4}$ and $T_{5}$ . **

By Lemma 13 again, we have

[TABLE]

By Lemma 11, we have

[TABLE]

This gives

[TABLE]

Let us consider $T_{4}$ term:

[TABLE]

Notice that by Lemma 11, Lemma 13 and the definition of ${\bm{M}}$ , for any integer $L$ :

[TABLE]

Hence,

[TABLE]

which gives

[TABLE]

We decompose $T_{5}$ using ${\bm{f}}={\bm{f}}_{\leq\ell}+{\bm{f}}_{>\ell}$ ,

[TABLE]

where

[TABLE]

First notice that

[TABLE]

Then by Lemma 13, we get

[TABLE]

Similarly, we get

[TABLE]

By Markov’s inequality, we deduce that

[TABLE]

**Step 6. Finish the proof. **

Combining Eqs. (137), (133), (138), (139) and (140), we have

[TABLE]

which concludes the proof.

10.2 Auxiliary results

Lemma 11.

Let $\{Y_{kl}\}_{k\in\mathbb{N},l\in[B(d,k)]}$ be the collection of spherical harmonics on $L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Let $({\bm{x}}_{i})_{i\in[n]}\sim_{iid}{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Denote

[TABLE]

Denote $B=\sum_{k=0}^{\ell}B(d,k)$ , and

[TABLE]

Then as long as $n/(B\log B)\to\infty$ as $d\to\infty$ , we have

[TABLE]

with ${\bm{\Delta}}\in\mathbb{R}^{B\times B}$ and $\mathbb{E}[\|{\bm{\Delta}}\|_{{\rm op}}]=o_{d}(1)$ .

Proof of Lemma 11. .

Let ${\bm{\Psi}}={\bm{Y}}^{\mathsf{T}}{\bm{Y}}/n\in\mathbb{R}^{B\times B}$ . We can rewrite ${\bm{\Psi}}$ as

[TABLE]

where

[TABLE]

We use matrix Bernstein inequality. Denote ${\bm{X}}_{i}={\bm{h}}_{i}{\bm{h}}_{i}-{\mathbf{I}}_{B}\in\mathbb{R}^{B\times B}$ . Then we have $\mathbb{E}[{\bm{X}}_{i}]={\bm{0}}$ , and

[TABLE]

where we use formula (34) and the normalization $Q_{k}(d)=1$ . Denote $V=\|\sum_{i=1}^{n}\mathbb{E}[{\bm{X}}_{i}^{2}]\|_{{\rm op}}$ . Then we have

[TABLE]

where we used ${\bm{h}}_{i}^{\mathsf{T}}{\bm{h}}_{i}=\|{\bm{h}}_{i}\|_{2}^{2}=B$ and $\mathbb{E}[{\bm{h}}_{i}({\bm{x}}_{i}){\bm{h}}_{i}^{\mathsf{T}}({\bm{x}}_{i})]=(\mathbb{E}[Y_{kl}({\bm{x}}_{i})Y_{rs}({\bm{x}}_{i})])_{kl,rs}={\mathbf{I}}_{B}$ . As a result, we have for any $t>0$ ,

[TABLE]

Integrating the tail bound proves the lemma. ∎

Lemma 12.

Let $\{Y_{kl}\}_{k\in\mathbb{N},l\in[B(d,k)]}$ be the collection of spherical harmonics on $L^{2}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Let $({\bm{x}}_{i})_{i\in[n]}\sim_{iid}{\sf Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Denote

[TABLE]

Then for $u,s,v\in\mathbb{N}$ and $u\neq v$ , we have

[TABLE]

For $u,s\in\mathbb{N}$ , we have

[TABLE]

Proof of Lemma 12.

We have

[TABLE]

This proves the lemma. ∎

Lemma 13.

Let $\{h_{d}\}_{d\geq 1}$ be a sequence of functions satisfying Assumption 3. Let $\omega_{d}(d^{\ell}\log d)\leq n\leq O_{d}(d^{\ell+1-\delta})$ . We have

[TABLE]

Proof of Lemma 13.

Denote

[TABLE]

Denote $B=\sum_{k\leq\ell}B(d,k)$ , and

[TABLE]

and

[TABLE]

Then we have

[TABLE]

where $\|{\bm{\Delta}}_{u}\|_{{\rm op}},\|{\bm{\Delta}}_{h}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ , and

[TABLE]

For $T_{1}$ , we have with high probability (note $n=O_{d}(d^{\ell+1-\delta})$ )

[TABLE]

To bound $T_{2}$ , let ${\bm{Y}}=\sqrt{n}{\bm{O}}{\bm{S}}{\bm{V}}^{{\mathsf{T}}}$ where ${\bm{O}}\in\mathbb{R}^{n\times n}$ and ${\bm{V}}\in\mathbb{R}^{B\times B}$ are orthogonal matrices, and ${\bm{S}}=[{\bm{S}}_{\star};{\bm{0}}]\equiv[{\mathbf{I}}_{B}+{\bm{\Delta}}_{s};{\bm{0}}]\in\mathbb{R}^{n\times B}$ . By Lemma 11 and the fact that $n=\omega_{d}(d^{\ell}\log d)$ , we have $\|{\bm{\Delta}}_{s}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ . Then we have

[TABLE]

where ${\bm{\Delta}}_{0}={\bm{O}}^{{\mathsf{T}}}{\bm{\Delta}}_{h}{\bm{O}}$ and $\|{\bm{\Delta}}_{0}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ .

For a symmetric matrix ${\bm{S}}_{0}\in\mathbb{R}^{B\times B}$ and a symmetric matrix ${\bm{A}}=[{\bm{A}}_{11},{\bm{A}}_{12};{\bm{A}}_{21},{\bm{A}}_{22}]\in\mathbb{R}^{n\times n}$ , we have

[TABLE]

where

[TABLE]

Taking

[TABLE]

with $\|{\bm{\Delta}}_{0,11}\|_{{\rm op}},\|{\bm{\Delta}}_{0,12}\|_{{\rm op}},\|{\bm{\Delta}}_{0,22}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ . This gives

[TABLE]

with ${\bm{\Delta}}_{1}=-({\mathbf{I}}_{n-B}+[\kappa_{h}/(\kappa_{h}+\lambda)]{\bm{\Delta}}_{0,22})^{-1}[\kappa_{h}/(\kappa_{h}+\lambda)]{\bm{\Delta}}_{0,12}$ and $\|{\bm{\Delta}}_{1}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ .

Now we look at ${\bm{B}}_{11}{\bm{S}}_{0}{\bm{B}}_{11}$ . We have

[TABLE]

where

[TABLE]

and

[TABLE]

and $\|{\bm{\Delta}}_{2}\|_{{\rm op}},\|{\bm{\Delta}}_{3}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ . Define

[TABLE]

We have

[TABLE]

Note we have

[TABLE]

and by Assumption 3 and $n=\omega_{d}(d^{\ell}\log d)$ we have $\lambda_{\min}({\bm{D}}_{1})=\omega_{d}(1)$ . As long with the fact that $\|{\bm{S}}_{\star}-{\mathbf{I}}_{B}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ , we have

[TABLE]

As a result, we have

[TABLE]

with $\|{\bm{\Delta}}_{t}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ . Finally, we have

[TABLE]

with $\|{\bm{\Delta}}_{y}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ . This proves the proposition. ∎

Lemma 14.

Let $\{h_{d}\}_{d\geq 1}$ be a sequence of functions satisfying Assumption 3. Let $\omega_{d}(d^{\ell}\log d)\leq n\leq O_{d}(d^{\ell+1-\delta})$ . We have

[TABLE]

Proof of Lemma 14.

By Proposition 3, we have ${\bm{H}}+\lambda{\mathbf{I}}_{n}={\bm{Y}}{\bm{D}}{\bm{Y}}^{\mathsf{T}}+(\kappa_{h}+\lambda){\mathbf{I}}_{n}+\kappa_{h}{\bm{\Delta}}_{h}$ with $\|{\bm{\Delta}}_{h}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ . Denote the singular value decomposition ${\bm{Y}}=\sqrt{n}{\bm{O}}{\bm{S}}{\bm{V}}^{\mathsf{T}}$ , with ${\bm{O}}\in\mathbb{R}^{n\times n}$ , ${\bm{V}}\in\mathbb{R}^{B\times B}$ be two orthogonal matrices, and ${\bm{S}}=[{\bm{S}}_{\star};{\bm{0}}]=[{\mathbf{I}}_{n}+{\bm{\Delta}}_{s};{\bm{0}}]\in\mathbb{R}^{n\times B}$ , with $\|{\bm{\Delta}}_{s}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ (Lemma 11). Then we have

[TABLE]

where $\|{\bm{\Delta}}_{i}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ for $i\in\{1,2\}$ and $\|{\bm{\Lambda}}_{i}-{\mathbf{I}}_{B}\|_{{\rm op}}=o_{d,\mathbb{P}}(1)$ for $i\in\{3,4,5\}$ , and

[TABLE]

and

[TABLE]

By Assumption 3, we have $\lambda_{\min}({\bm{D}}_{1})=\omega_{d}(1)$ . This proves the lemma. ∎

Acknowledgements

This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729, NSF DMS-1418362, NSF DMS-1407813.

Appendix A Numerical results with ridge regression

The reader might wonder whether the numerical results presented in Section 1.3 might change significantly if we changed the method to estimate the coefficients ${\bm{a}}=(a_{i})_{i\leq N}\in{\mathbb{R}}^{N}$ (for the model RF) or ${\bm{a}}=({\bm{a}}_{i})_{i\leq N}\in{\mathbb{R}}^{Nd}$ . Our main results –Theorem 1 and Theorem 2.(a)– predict that the result should not change qualitatively: these models are limited because they cannot approximate the target function $f_{\star}$ (unless this is a low degree polynomial), regardless of the choice of the representative $f\in{\mathcal{F}}_{{\sf RF}}$ or $f\in{\mathcal{F}}_{{\sf NT}}$ .

In order to verify this prediction numerically, we repeated the experiments of Section 1.3 using ridge regression. We form a matrix ${\bm{Z}}\in{\mathbb{R}}^{n\times p}$ containing the $p$ covariates (with $p=N$ for RF, and $p=Nd$ for NT), whereby $Z_{ij}=\sigma(\langle{\bm{w}}_{j},{\bm{x}}_{i}\rangle)$ for RF, and $Z_{i,(j_{1}j_{2})}=({\bm{x}}_{i})_{j_{2}}\sigma(\langle{\bm{w}}_{j_{1}},{\bm{x}}_{i}\rangle)$ for NT. Letting $y_{i}=f_{\star}({\bm{x}}_{i})$ , we estimate the coefficients ${\bm{a}}$ via

[TABLE]

The results are reported in Figures 6, 7, 8, and are consistent with the ones of Section 1.3. Regularization does not help: it only reduces the peak at $n\approx p$ , as expected from [HMRT19], but not the large $n$ behavior.

(Note that for RF we do not report results for $d=100$ , in Fig. 6. As in Fig. 1, the resulting risk is slightly below the baseline $R_{0}$ : this effect vanishes for $d\gtrsim 100$ .)

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AB 09] Martin Anthony and Peter L Bartlett, Neural network learning: Theoretical foundations , cambridge university press, 2009.
2[ADH + 19] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang, Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks , ar Xiv:1901.08584 (2019).
3[AM 15] Ahmed El Alaoui and Michael W Mahoney, Fast randomized kernel ridge regression with statistical guarantees , Advances in Neural Information Processing Systems, 2015, pp. 775–783.
4[AZLS 18] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization , ar Xiv:1811.03962 (2018).
5[Bac 13] Francis Bach, Sharp analysis of low-rank kernel matrix approximations , Conference on Learning Theory, 2013, pp. 185–209.
6[Bac 17a] , Breaking the curse of dimensionality with convex neural networks , The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
7[Bac 17b] , On the equivalence between kernel quadrature rules and random feature expansions , The Journal of Machine Learning Research 18 (2017), no. 1, 714–751.
8[Bar 93] Andrew R Barron, Universal approximation bounds for superpositions of a sigmoidal function , IEEE Transactions on Information theory 39 (1993), no. 3, 930–945.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Linearized two-layers neural networks in high dimension

Abstract

Contents

1 Introduction and main results

1.1 Background

1.2 A parenthesis

1.3 A numerical experiment

1.4 Summary of main results

2 Approximation error of linearized neural networks

2.1 Approximation error of random features models

Assumption 1** (Assumptions for the RF model at level ℓ∈N\ell\in{\mathbb{N}}ℓ∈N).**

Theorem 1** (Risk of the RF model).**

Remark 2.1**.**

Remark 2.2**.**

Remark 2.3**.**

2.2 Approximation error of neural tangent models

Assumption 2** (Assumptions for the NT model at level ℓ∈N\ell\in{\mathbb{N}}ℓ∈N.).**

Theorem 2** (Risk of the NT model).**

Remark 2.4**.**

2.3 Separation between NN and RF, NT

3 Generalization error of kernel methods

3.1 Lower bound for general kernel methods

Theorem 3**.**

Proof.

3.2 Upper bound for kernel ridge regression

Assumption 3** (Assumption for KRR at level ℓ∈N\ell\in\mathbb{N}ℓ∈N).**

Theorem 4**.**

Remark 3.1**.**

Remark 3.2**.**

3.3 Separation between kernel methods and neural networks

3.4 Near-optimality of interpolators

Theorem 5**.**

Proof of Theorem 5.

3.5 A conjecture for generalization error of random features model

4 Further related work

5 Technical background

5.1 Functional spaces over the sphere

5.2 Gegenbauer polynomials

5.3 Hermite polynomials

5.4 Notations

6 Proof of Theorem 1.(a): RF model lower bound

6.1 Proof of Theorem 1.(a): Outline

Proposition 1** (Expected norm of V{\bm{V}}V).**

Proposition 2** (Lower bound on the kernel matrix).**

Proposition 3** (Bound on the Gram matrix).**

Lemma 1**.**

Proof of Lemma 1.

6.2 Proof of Proposition 1

6.3 Proof of Proposition 2

6.4 Proof of Proposition 3

Lemma 2**.**

Proof.

Example 1**.**

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

7 Proof of Theorem 1.(b): RF model upper bound

8 Proof of Theorem 2.(a): NT model lower bound

8.1 Preliminaries

Lemma 5**.**

Proof.

Assumption 4** (Integrability condition).**

Assumption 5** (Level-ℓ\ellℓ non-trivial Hermite components).**

8.2 Proof of Theorem 2.(a): Outline

Proposition 4** (Expected norm of V{\bm{V}}V).**

Proposition 5** (Lower bound on the kernel matrix).**

8.3 Proof of Proposition 4

8.4 Proof of Proposition 5

8.4.1 Auxiliary lemmas

Lemma 6**.**

Proof.

Lemma 7**.**

Proof.

Assumption 1 (Assumptions for the RF model at level $\ell\in{\mathbb{N}}$ ).

Theorem 1 (Risk of the RF model).

Remark 2.1.

Remark 2.2.

Remark 2.3.

Assumption 2 (Assumptions for the NT model at level $\ell\in{\mathbb{N}}$ .).

Theorem 2 (Risk of the NT model).

Remark 2.4.

Theorem 3.

Assumption 3 (Assumption for KRR at level $\ell\in\mathbb{N}$ ).

Theorem 4.

Remark 3.1.

Remark 3.2.

Theorem 5.

Proposition 1 (Expected norm of ${\bm{V}}$ ).

Proposition 2 (Lower bound on the kernel matrix).

Proposition 3 (Bound on the Gram matrix).

Lemma 1.

Lemma 2.

Example 1.

Lemma 3.

Lemma 4.

Lemma 5.

Assumption 4 (Integrability condition).

Assumption 5 (Level- $\ell$ non-trivial Hermite components).

Proposition 4 (Expected norm of ${\bm{V}}$ ).

Proposition 5 (Lower bound on the kernel matrix).

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.