# Linearized two-layers neural networks in high dimension

**Authors:** Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

arXiv: 1904.12191 · 2020-02-18

## TL;DR

This paper analyzes the approximation capabilities of linearized two-layer neural networks in high-dimensional settings, revealing how they fit polynomial functions and relate to kernel methods under different regimes.

## Contribution

It provides a rigorous characterization of the polynomial approximation limits of random feature and neural tangent kernel models in high dimensions.

## Key findings

- RF fits degree-ℓ polynomials in the approximation-limited regime
- NT fits degree-(ℓ+1) polynomials in the approximation-limited regime
- Kernel methods are limited to degree-ℓ polynomials in the sample size-limited regime

## Abstract

We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples $\{(y_i,{\boldsymbol x}_i)\}_{i\le n}$ where ${\boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$.   We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{\ell + \delta} \le N\le d^{\ell+1-\delta}$ for small $\delta > 0$, then \RF\, effectively fits a degree-$\ell$ polynomial in the raw features, and \NT\, fits a degree-$(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{\ell + \delta} \le n \le d^{\ell +1-\delta}$, then kernel methods can fit at most a a degree-$\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.12191/full.md

## Figures

24 figures with captions in the complete paper: https://tomesphere.com/paper/1904.12191/full.md

## References

55 references — full list in the complete paper: https://tomesphere.com/paper/1904.12191/full.md

---
Source: https://tomesphere.com/paper/1904.12191