A Comparative Analysis of the Optimization and Generalization Property   of Two-layer Neural Network and Random Feature Models Under Gradient Descent   Dynamics

Weinan E; Chao Ma; Lei Wu

arXiv:1904.04326·cs.LG·February 27, 2020

A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

Weinan E, Chao Ma, Lei Wu

PDF

TL;DR

This paper analyzes the training dynamics and generalization properties of two-layer neural networks and random feature models under gradient descent, revealing exponential convergence and kernel-like behavior.

Contribution

It provides a comprehensive theoretical analysis of gradient descent dynamics for two-layer neural networks, including convergence rates and generalization error estimates.

Findings

01

Gradient descent achieves exponential convergence to zero training loss.

02

Neural network functions remain close to kernel methods during training.

03

Sharp bounds on generalization error are established for various network widths and data sizes.

Abstract

A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space.

Equations385

f_{m} (x; Θ) = a^{T} σ (B x),

f_{m} (x; Θ) = a^{T} σ (B x),

R (Θ) = \frac{1}{2} E_{x, y} [(f (x; Θ) - y)^{2}] .

R (Θ) = \frac{1}{2} E_{x, y} [(f (x; Θ) - y)^{2}] .

\hat{R}_{n} (Θ) = \frac{1}{2 n} i = 1 \sum n (f (x_{i}; Θ) - y_{i})^{2} .

\hat{R}_{n} (Θ) = \frac{1}{2 n} i = 1 \sum n (f (x_{i}; Θ) - y_{i})^{2} .

\frac{d Θ _{t}}{d t} = - \nabla \hat{R}_{n} (Θ_{t}) .

\frac{d Θ _{t}}{d t} = - \nabla \hat{R}_{n} (Θ_{t}) .

k^{(a)} (x, x^{'})

k^{(a)} (x, x^{'})

k^{(b)} (x, x^{'})

K_{i, j}^{(a)}

K_{i, j}^{(a)}

K_{i, j}^{(b)}

λ_{n}^{(a)} = def λ_{m i n} (K^{a}) > 0, λ_{n}^{(b)} = def λ_{m i n} (K^{(b)}) > 0.

λ_{n}^{(a)} = def λ_{m i n} (K^{a}) > 0, λ_{n}^{(b)} = def λ_{m i n} (K^{(b)}) > 0.

T_{s} f (x) = \int_{S^{d - 1}} s (x, x^{'}) f (x^{'}) d π_{0} (x^{'}) .

T_{s} f (x) = \int_{S^{d - 1}} s (x, x^{'}) f (x^{'}) d π_{0} (x^{'}) .

f_{m} (x; \tilde{a}, B_{0}) = def \tilde{a}^{T} σ (B_{0} x),

f_{m} (x; \tilde{a}, B_{0}) = def \tilde{a}^{T} σ (B_{0} x),

\frac{d a ~ _{t}}{d t} = - \frac{1}{n} i = 1 \sum n (\tilde{a}_{t}^{T} σ (B_{0} x_{i}) - y_{i}) σ (B_{0} x_{i}) .

\frac{d a ~ _{t}}{d t} = - \frac{1}{n} i = 1 \sum n (\tilde{a}_{t}^{T} σ (B_{0} x_{i}) - y_{i}) σ (B_{0} x_{i}) .

G_{i, j}^{(a)} (Θ)

G_{i, j}^{(a)} (Θ)

G_{i, j}^{(b)} (Θ)

∥ \nabla_{Θ} \hat{R}_{n} ∥^{2} = \frac{m}{n} e^{T} G e .

∥ \nabla_{Θ} \hat{R}_{n} ∥^{2} = \frac{m}{n} e^{T} G e .

2 m λ_{m i n} (G) \hat{R}_{n} \leq ∥ \nabla_{Θ} \hat{R}_{n} ∥^{2} \leq 2 m λ_{m a x} (G) \hat{R}_{n} .

2 m λ_{m i n} (G) \hat{R}_{n} \leq ∥ \nabla_{Θ} \hat{R}_{n} ∥^{2} \leq 2 m λ_{m a x} (G) \hat{R}_{n} .

\hat{R}_{n} (Θ_{0}) \leq \frac{1}{2} (1 + c (δ) m β)^{2},

\hat{R}_{n} (Θ_{0}) \leq \frac{1}{2} (1 + c (δ) m β)^{2},

G^{(a)} (Θ_{0}) \to K^{(a)}, G^{(b)} (Θ_{0}) \to β^{2} K^{(b)} as m \to \infty.

G^{(a)} (Θ_{0}) \to K^{(a)}, G^{(b)} (Θ_{0}) \to β^{2} K^{(b)} as m \to \infty.

λ_{m i n} (G (Θ_{0})) \geq \frac{3}{4} (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)}) .

λ_{m i n} (G (Θ_{0})) \geq \frac{3}{4} (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)}) .

I (Θ_{0}) = def {Θ : ∥ G (Θ) - G (Θ_{0}) ∥_{F} \leq \frac{1}{4} (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)})} .

I (Θ_{0}) = def {Θ : ∥ G (Θ) - G (Θ_{0}) ∥_{F} \leq \frac{1}{4} (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)})} .

λ_{m i n} (G (Θ)) \geq λ_{m i n} (G (Θ_{0})) - ∥ G (Θ) - G (Θ_{0}) ∥_{F} \geq \frac{1}{2} (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)})

λ_{m i n} (G (Θ)) \geq λ_{m i n} (G (Θ_{0})) - ∥ G (Θ) - G (Θ_{0}) ∥_{F} \geq \frac{1}{2} (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)})

t_{0} = def in f {t : Θ_{t} \in / I (Θ_{0})} .

t_{0} = def in f {t : Θ_{t} \in / I (Θ_{0})} .

\hat{R}_{n} (Θ_{t}) \leq e^{- m (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)}) t} \hat{R}_{n} (Θ_{0}) .

\hat{R}_{n} (Θ_{t}) \leq e^{- m (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)}) t} \hat{R}_{n} (Θ_{0}) .

\frac{d R ^ _{n} ( Θ _{t} )}{d t} = - ∥ \nabla_{Θ} \hat{R}_{n} ∥_{F}^{2} \leq - m (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)}) \hat{R}_{n} (Θ_{t}),

\frac{d R ^ _{n} ( Θ _{t} )}{d t} = - ∥ \nabla_{Θ} \hat{R}_{n} ∥_{F}^{2} \leq - m (λ_{n}^{(a)} + β^{2} λ_{n}^{(b)}) \hat{R}_{n} (Θ_{t}),

p_{n} = def \frac{4 R ^ _{n} ( Θ _{0} )}{m ( λ _{n}^{(a)} + β ^{2} λ _{n}^{(b)} )}, q_{n} = def p_{n}^{2} + β p_{n} .

p_{n} = def \frac{4 R ^ _{n} ( Θ _{0} )}{m ( λ _{n}^{(a)} + β ^{2} λ _{n}^{(b)} )}, q_{n} = def p_{n}^{2} + β p_{n} .

∣ a_{k} (t) - a_{k} (0) ∣

∣ a_{k} (t) - a_{k} (0) ∣

∥ b_{k} (t) - b_{k} (0) ∥

∥ \nabla_{a_{k}} \hat{R}_{n} ∥^{2}

∥ \nabla_{a_{k}} \hat{R}_{n} ∥^{2}

∥ \nabla_{b_{k}} \hat{R}_{n} ∥^{2}

α_{k} (t) = s \in [0, t] max ∣ a_{k} (s) ∣, ω_{k} (t) = s \in [0, t] max ∥ b_{k} (s) ∥.

α_{k} (t) = s \in [0, t] max ∣ a_{k} (s) ∣, ω_{k} (t) = s \in [0, t] max ∥ b_{k} (s) ∥.

∥ b_{k} (t) - b_{k} (0) ∥

∥ b_{k} (t) - b_{k} (0) ∥

\leq 2 \int_{0}^{t} α_{k} (t) \hat{R}_{n} (Θ_{t^{'}}) d t^{'}

\leq \frac{4 R ^ _{n} ( Θ _{0} ) α _{k} ( t )}{m ( λ _{n}^{(a)} + β ^{2} λ _{n}^{(b)} )} = p_{n} α_{k} (t),

∣ a_{k} (t) - a_{k} (0) ∣

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Comparative Analysis of Optimization and Generalization Properties of

Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

Weinan E [email protected] Department of Mathematics, Princeton University

Program in Applied and Computational Mathematics, Princeton University

Beijing Institute of Big Data Research

Chao Ma [email protected] Program in Applied and Computational Mathematics, Princeton University

Lei Wu [email protected] Program in Applied and Computational Mathematics, Princeton University

Abstract

A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space.

1 Introduction

Optimization and generalization are two central issues in the theoretical analysis of machine learning models. These issues are of special interest for modern neural network models, not only because of their practical success [18, 19], but also because of the fact that these neural network models are often heavily over-parametrized and traditional machine learning theory does not seem to work directly [21, 30]. For this reason, there has been a lot of recent theoretical work centered on these issues [15, 16, 12, 11, 2, 8, 10, 31, 29, 28, 25, 27]. One issue of particular interest is whether the gradient descent (GD) algorithm can produce models that optimize the empirical risk and at the same time generalize well for the population risk. In the case of over-parametrized two-layer neural network models, which will be the focus of this paper, it is generally understood that as a result of the non-degeneracy of the associated Gram matrix [29, 12], optimization can be accomplished using the gradient descent algorithm regardless of the quality of the labels, in spite of the fact that the empirical risk function is non-convex. In this regard, one can say that over-parametrization facilitates optimization.

The situation with generalization is a different story. There has been a lot of interest on the so-called “implicit regularization” effect [21], i.e. by tuning the parameters in the optimization algorithms, one might be able to guide the algorithm to move towards network models that generalize well, without the need to add any explicit regularization terms (see below for a review of the existing literature). But despite these efforts, it is fair to say that the general picture has yet to emerge.

In this paper, we perform a rather thorough analysis of the gradient descent algorithm for training two-layer neural network models. We study the case in which the parameters in both the input and output layers are updated – the case found in practice. In the heavily over-parametrized regime, for general initializations, we prove that the results of [12] still hold, namely, the gradient descent dynamics still converges to a global minimum exponentially fast, regardless of the quality of the labels. However, we also prove that the functions obtained are uniformly close to the ones found in an associated kernel method, with the kernel defined by the initialization.

In the second part of the paper, we study the more general situation when the assumption of over-parametrization is relaxed. We provide sharp estimates for both the empirical and population risks. In particular, we prove that for target functions in the appropriate reproducing kernel Hilbert space (RKHS) [3], the generalization error can be made small if certain early stopping strategy is adopted for the gradient descent algorithm.

Our results imply that under this setting over-parametrized two-layer neural networks are a lot like the kernel methods: They can always fit any set of random labels, but in order to generalize, the target functions have to be in the right RKHS. This should be compared with the optimal generalization error bounds proved in [13] for regularized models.

1.1 Related work

The seminal work of [30] presented both numerical and theoretical evidence that over-parametrized neural networks can fit random labels. Building upon earlier work on the non-degeneracy of some Gram matrices [29], Du et al. went a step further by proving that the GD algorithm can find global minima of the empirical risk for sufficiently over-parametrized two-layer neural networks [12]. This result was extended to multi-layer networks in [11, 2] or a general setting [9]. The related result for infinitely wide neural networks was obtained in [14]. In this paper, we prove a new optimization result (Theorem 3.2) that removes the non-degeneracy assumption of the input data by utilizing the smoothness of the target function. Also the requirement of the network width is significantly relaxed.

The issue of generalization is less clear. [10] established generalization error bounds for solutions produced by the online stochastic gradient descent (SGD) algorithm with early stopping when the target function is in a certain RKHS. Similar results were proved in [20] for the classification problem, in [8] for offline SGD algorithms, and in [1] for GD algorithm. These results are similar to ours, but we do not require the network to be over-parametrized. Moreover, in Theorem 3.3 we show that in this setting neural networks are uniformly close to the random feature models if the network is highly over-parametrized.

More recently in [4], a generalization bound was derived for GD solutions using a data-dependent norm. This norm is bounded if the target function belongs to the appropriate RKHS. However, their error bounds are not strong enough to rule out the possibility of curse of dimensionality. Indeed the results of the present paper do suggest that curse of dimensionality does occur in their setting (see Theorem 3.4).

[14] provided by a heuristic argument that the GD solutions of a infinitely-wide neural network are captured by the so-called neural tangent kernel. In this paper, we provide a rigorous proof of the non-asymptotic version of the result for the two-layer neural network under weaker conditions.

2 Preliminaries

Throughout this paper, we will use the following notation $[n]=\{1,2,\dots,n\}$ , if $n$ is a positive integer. We use $\|\|$ and $\|\|_{F}$ to denote the $\ell_{2}$ and Frobenius norms for matrices, respectively. We let $\mathbb{S}^{d-1}=\{\bm{x}\,:\,\|\bm{x}\|=1\}$ , and use $\pi_{0}$ to denote the uniform distribution over $\mathbb{S}^{d-1}$ . We use $X\lesssim Y$ to indicate that there exists an absolute constant $C_{0}>0$ such that $X\leq C_{0}Y$ , and $X\gtrsim Y$ is similarly defined. If $f$ is a function defined on $\mathbb{R}^{d}$ and $\mu$ is a probability distribution on $\mathbb{R}^{d}$ , we let $\|f\|_{\mu}=(\int_{\mathbb{R}^{d}}f(\bm{x})^{2}d\mu(\bm{x}))^{1/2}$ .

2.1 Problem setup

We focus on the regression problem with a training data set given by $\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}$ , i.i.d. samples drawn from a distribution $\rho$ , which is assumed fixed but only known through the samples. In this paper, we assume $\|\bm{x}\|_{2}=1$ and $|y|\leq 1$ . We are interested in fitting the data by a two-layer neural network:

[TABLE]

where $\bm{a}\in\mathbb{R}^{m},B=(\bm{b}_{1},\bm{b}_{2},\cdots,\bm{b}_{m})^{T}\in\mathbb{R}^{m\times d}$ and $\Theta=\{\bm{a},B\}$ denote all the parameters. Here $\sigma(t)=\max(0,t)$ is the ReLU activation function. We will omit the subscript $m$ in the notation for $f_{m}$ if there is no danger of confusion. In formula (1), we omit the bias term for notational simplicity. The effect of the bias term can be incorporated if we think of $\bm{x}$ as $(\bm{x},1)^{T}$ .

The ultimate goal is to minimize the population risk defined by

[TABLE]

But in practice, we can only work with the following empirical risk

[TABLE]

Gradient Descent

We are interested in analyzing the property of the following gradient descent algorithm: $\Theta_{t+1}=\Theta_{t}-\eta\nabla\hat{\mathcal{R}}_{n}(\Theta_{t}),$ where $\eta$ is the learning rate. For simplicity, we will focus on its continuous version, the gradient descent (GD) dynamics:

[TABLE]

Initialization

$\Theta_{0}=\{\bm{a}(0),B(0)\}$ . We assume that $\{\bm{b}_{k}(0)\}_{k=1}^{m}$ are i.i.d. random variables drawn from $\pi_{0}$ , and $\{a_{k}(0)\}_{k=1}^{m}$ are i.i.d. random variables drawn from the distribution defined by $\mathbb{P}\{a_{k}(0)=\beta\}=\mathbb{P}\{a_{k}(0)=-\beta\}=\frac{1}{2}$ . Here $\beta$ controls the magnitude of the initialization, and it may depend on $m$ , e.g. $\beta=\frac{1}{m}$ or $\frac{1}{\sqrt{m}}$ . Other initialization schemes can also be considered (e.g. distributions other than $\pi_{0}$ , other ways of initializing $\bm{a}$ ). The needed argument does not change much from the ones for this special case.

2.2 Assumption on the input data

With the activation function $\sigma(\cdot)$ and the distribution $\pi_{0}$ , we can define two positive definite (PD) functions 111We say that a continuous symmetric function $k$ is positive definite if and only if for any $\bm{x}_{1},\dots,\bm{x}_{n}$ , the kernel matrix $K=(K_{i,j})\in\mathbb{R}^{n\times n}$ with $K_{i,j}=k(\bm{x}_{i},\bm{x}_{j})$ is positive definite.

[TABLE]

For a fixed training sample, the corresponding normalized kernel matrices $K^{(a)}=(K^{(a)}_{i,j}),K^{(b)}=(K^{(b)}_{i,j})\in\mathbb{R}^{n\times n}$ are defined by

[TABLE]

Throughout this paper, we make the following assumption on the training set.

Assumption 1.

For the given training set $\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}$ , we assume that the smallest eigenvalues of the two kernel matrices defined above are both positive, i.e.

[TABLE]

Let $\lambda_{n}=\min\{\lambda_{n}^{a},\lambda_{n}^{b}\}$ .

Remark 1.

Note that $\lambda_{n}^{(a)}\leq\min_{i\in[n]}K^{(a)}_{i,i}\leq 1/n,\lambda_{n}^{(b)}\leq\min_{i\in[n]}K^{(a)}_{i,i}\leq 1/n$ . In general, $\lambda^{(a)}_{n},\lambda^{(b)}_{n}$ depend on the data set. For any PD functions $s(\cdot,\cdot)$ , the Hilbert-Schmidt integral operator $T_{s}:L^{2}(\mathbb{S}^{d-1},\pi_{0})\mapsto L^{2}(\mathbb{S}^{d-1},\pi_{0})$ is defined by

[TABLE]

Let $\Lambda_{n}(T_{s})$ denote its $n$ -th largest eigenvalue. If $\{\bm{x}_{i}\}_{i=1}^{n}$ are independently drawn from $\pi_{0}$ , it was proved in [6] that with high probability $\lambda^{(a)}_{n}\geq\Lambda_{n}(T_{k^{(a)}})/2$ and $\lambda^{(b)}_{n}\geq\Lambda_{n}(T_{k^{(b)}})/2$ . Using the similar idea, [29] provided lower bounds for $\lambda^{(b)}_{n}$ based on some geometric discrepancy, which quantifies the uniformity degree of $\{\bm{x}_{i}\}_{i=1}^{n}$ . In this paper, we leave $\lambda^{(a)}_{n}>0,\lambda^{(b)}_{n}>0$ as our basic assumption.

2.3 The random feature model

We introduce the following random feature model [22] as a reference for the two-layer neural network model

[TABLE]

where $\bm{a}\in\mathbb{R}^{m},B_{0}\in\mathbb{R}^{m\times d}$ . Here $B_{0}$ is fixed at the corresponding initial values for the neural network model, and is not part of the parameters to be trained. The corresponding gradient descent dynamics is given by

[TABLE]

This dynamics is relatively simple since it is linear.

3 Analysis of the over-parameterized case

In this section, we consider the optimization and generalization properties of the GD dynamics in the over-parametrized regime. We introduce two Gram matrices $G^{(a)}(\Theta),G^{(b)}(\Theta)\in\mathbb{R}^{n\times n}$ , defined by

[TABLE]

Let $G=G^{(a)}+G^{(b)}\in\mathbb{R}^{n\times n},e_{j}=f(\bm{x}_{j},\Theta)-y_{j}$ and $\bm{e}=(e_{1},e_{2},\cdots,e_{n})^{T}$ , it is easy to see that

[TABLE]

Since $\hat{\mathcal{R}}_{n}=\frac{1}{2n}\bm{e}^{T}\bm{e}$ , we have

[TABLE]

3.1 Properties of the initialization

Lemma 1.

For any fixed $\delta>0$ , with probability at least $1-\delta$ over the random initialization, we have

[TABLE]

where $c(\delta)=2+\sqrt{\ln(1/\delta)}$ .

The proof of this lemma can be found in Appendix C.

In addition, at the initialization, the Gram matrices satisfy

[TABLE]

In fact, we have

Lemma 2.

For $\delta>0$ , if $m\geq\frac{8}{\lambda_{n}^{2}}\ln(2n^{2}/\delta)$ , we have, with probability at least $1-\delta$ over the random choice of $\Theta_{0}$

[TABLE]

The proof of this lemma is deferred to Appendix D.

3.2 Gradient descent near the initialization

We define a neighborhood of the initialization by

[TABLE]

Using the lemma above, we conclude that for any fixed $\delta>0$ , with probability at least $1-\delta$ over the random choices of $\Theta_{0}$ , we must have

[TABLE]

for all $\Theta\in\mathcal{I}(\Theta_{0})$ .

For the GD dynamics, we define the exit time of $\mathcal{I}(\Theta_{0})$ by

[TABLE]

Lemma 3.

For any fixed $\delta\in(0,1)$ , assume that $m\geq\frac{8}{\lambda_{n}^{2}}\ln(2n^{2}/\delta)$ . Then with probability at least $1-\delta$ over the random choices of $\Theta_{0}$ , we have the following holds for any $t\in[0,t_{0}]$ ,

[TABLE]

Proof.

We have

[TABLE]

where the last inequality is due to the fact that $\Theta_{t}\in\mathcal{I}(\Theta_{0})$ . This completes the proof. ∎

We define two quantities:

[TABLE]

The following is the most crucial characterization of the GD dynamics.

Proposition 3.1.

For any $\delta>0$ , assume $m\geq 1024\lambda_{n}^{-2}\ln(n^{2}/\delta)$ . Then, with probability at least $1-\delta$ , we have the following holds for any $t\in[0,t_{0}]$ ,

[TABLE]

Proof.

First, we have

[TABLE]

To facilitate the analysis, we define the following two quantities,

[TABLE]

Using Lemma 3, we have

[TABLE]

Combining the two inequalities above, we get

[TABLE]

Using Lemma 1 and the fact that $m\geq\max\{\frac{16}{\lambda^{(a)}_{n}},\frac{64c^{2}(\delta)}{\lambda^{(b)}_{n}\lambda^{(a)}_{n}}\}$ , we have

[TABLE]

Therefore,

[TABLE]

Inserting the above estimates back to (10), we obtain

[TABLE]

Since $m\geq\max\{\frac{16}{\sqrt{\lambda^{(a)}_{n}\lambda^{(b)}_{n}}},\frac{1024c^{2}(\delta)}{(\lambda^{(b)}_{n})^{2}}\}$ , we have

[TABLE]

Therefore we have $\omega_{k}(t)\leq 1+\|\bm{b}_{k}(t)-\bm{b}_{k}(0)\|\leq 2$ , which leads to

[TABLE]

∎

The following lemma provides that how $p_{n}$ and $q_{n}$ depend on $\beta$ and $m$ .

Lemma 4.

For any $\delta>0$ , assume $m\geq 1024\lambda_{n}^{-2}\ln(n^{2}/\delta)$ . Let $C(\delta)=10c^{2}(\delta)$ . If $\beta\leq 1$ , we have

[TABLE]

If $\beta>1$ , we have

[TABLE]

3.3 Global convergence for arbitrary labels

Proposition 3.1 and Lemma 4 tell us that no matter how large $\beta$ is, we have

[TABLE]

This actually implies that the GD dynamics always stays in $\mathcal{I}(\Theta_{0})$ , i.e. $t_{0}=\infty$ .

Theorem 3.2.

For any $\delta\in(0,1)$ , assume $m\gtrsim\lambda_{n}^{-4}n^{2}\delta^{-1}\ln(n^{2}/\delta)$ . Then with probability at least $1-\delta$ over the random initialization, we have

[TABLE]

for any $t\geq 0$ .

Proof.

According to Lemma 3, we only need to prove that $t_{0}=\infty$ . Assume $t_{0}<\infty$ .

Let us first consider the Gram matrix $G^{(a)}$ . Since $\sigma(\cdot)$ is $1-$ Lipschitz and $\max_{k}\|\bm{b}_{k}(t_{0})-\bm{b}_{k}(0)\|\leq q_{n}\leq 1$ , we have

[TABLE]

This leads to

[TABLE]

Next we turn to the Gram matrix $G^{(b)}$ . Define the event

[TABLE]

Since $\sigma(\cdot)$ is ReLU, this event happens only if $|\bm{x}_{i}^{T}\bm{b}_{k}(0)|\leq q_{n}$ . By the fact that $\|\bm{x}_{i}\|=1$ and $\bm{b}_{k}(0)$ is drawn from the uniform distribution over the sphere, we have $\mathbb{P}[D_{i,k}]\lesssim q_{n}$ . Therefore the entry-wise deviation of $G^{(b)}$ satisfies,

[TABLE]

where

[TABLE]

Note that $\mathbb{E}[Q_{k,i,j}]\leq\mathbb{P}[D_{k,i}\cup D_{k,j}]\lesssim q_{n}$ . In addition, by Proposition 3.1, we have

[TABLE]

Hence using $q_{n}=p_{n}^{2}+\beta p_{n}\leq 1$ , we obtain

[TABLE]

By the Markov inequality, with probability $1-\delta/n$ we have

[TABLE]

Consequently, with probability $1-\delta$ we have

[TABLE]

Combining (15) and (17), we get

[TABLE]

where the last inequality comes from Lemma (4). Taking $m\gtrsim\lambda_{n}^{-4}n^{2}\delta^{-1}\ln(n^{2}/\delta)$ , we get

[TABLE]

The above result contradicts the definition of $t_{0}$ . Therefore $t_{0}=\infty$ . ∎

Remark 2.

Compared with Proposition 3.1, the above theorem imposes a stronger assumption on the network width: $m\geq\text{poly}(\delta^{-1})$ . This is due to the lack of continuity of $\sigma^{\prime}(\cdot)$ when handling $\|G^{(b)}(\Theta_{t_{0}})-G^{(b)}(\Theta_{0})\|_{F}$ . If $\sigma^{\prime}(\cdot)$ is continuous, we can get rid of the dependence on $\text{poly}(\delta^{-1})$ . In addition, it is also possible to remove this assumption for the case when $\beta=o(1)$ , since in this case the Gram matrix $G=G^{(a)}+\beta^{2}G^{(b)}$ is dominated by $G^{(a)}$ .

Remark 3.

Theorem 3.2 is closely related to the result of Du et al. [12] where exponential convergence to global minima was first proved for over-parametrized two-layer neural networks. But it improves the result of [12] in two aspects. First, as is done in practice, we allow the parameters in both layers to be updated, while [12] chooses to freeze the parameters in the first layer. Secondly, our analysis does not impose any specific requirement on the scale of the initialization whereas the proof of [12] relies on the specific scaling: $\beta\sim 1/\sqrt{m}$ .

3.4 Characterization of the whole GD trajectory

In the last subsection, we showed that very wide networks can fit arbitrary labels. In this subsection, we study the functions represented by such networks. We show that for highly over-parametrized two-layer neural networks, the solution of the GD dynamics is uniformly close to the solution for the random feature model starting from the same initial function.

Theorem 3.3.

Assume $\beta\leq 1$ . Denote the solution of GD dynamics for the random feature model by

[TABLE]

where $\tilde{\bm{a}}_{t}$ is the solution of GD dynamics (5). For any $\delta\in(0,1)$ , assume that $m\gtrsim\lambda_{n}^{-4}n^{2}\delta^{-1}\ln(n^{2}\delta^{-1})$ . Then with probability at least $1-6\delta$ we have,

[TABLE]

where $c(\delta)=1+\sqrt{\ln(1/\delta)}$ .

Remark 4.

Again the factor $\delta^{-1}$ in the condition for $m$ can be removed if $\sigma$ is assumed to be smooth or $\beta$ is assumed to be small (see the remark at the end of Theorem 3.2).

Remark 5.

If $\beta=o(m^{-1/6})$ , the right-hand-side of (18) goes to [math] as $m\rightarrow\infty$ . For example, if we take $\beta=1/\sqrt{m}$ , we have

[TABLE]

Hence this theorem says that the GD trajectory of a very wide network is uniformly close to the GD trajectory of the related kernel method (5).

Proof of Theorem 3.3

We define

[TABLE]

Recall the definition of $G(\Theta_{t})$ in Section 3, we know that $G(\Theta_{t})_{i,j}=g^{m}(x_{i},x_{j},t)$ . For any $\bm{x}\in\mathbb{S}^{d-1}$ , let $\bm{g}(\bm{x},t)$ and $\bm{g}^{(a)}(\bm{x})$ be two $n$ -dimensional vectors defined by

[TABLE]

For GD dynamics (2), define $\bm{e}(t)=(f_{m}(\bm{x};\Theta_{t})-y_{i})\in\mathbb{R}^{n}$ . Then we have,

[TABLE]

For GD dynamics (5) of the random feature model, we define $\tilde{\bm{e}}(t)=(f_{m}(\bm{x}_{i};\tilde{\bm{a}}_{t},B_{0})-y_{i})\in\mathbb{R}^{n}$ . Then, we have

[TABLE]

From (22) and (23), we have

[TABLE]

Let

[TABLE]

then we have

[TABLE]

We are now going to bound $J_{1}(\bm{x},t)$ and $J_{2}(\bm{x},t)$ .

We first consider $J_{1}$ . By Theorem (3.1), with probability at least $1-\delta$ we have

[TABLE]

for any $t\geq 0$ . Therefore, for any $\bm{x}\in\mathbb{S}^{d-1}$ , we have

[TABLE]

Hence, by the estimates of $\hat{\mathcal{R}}_{n}(\Theta_{0})$ in Lemma 1, we have

[TABLE]

Inserting the estimate of $q_{n}$ in Lemma 4, we get

[TABLE]

Next we turn to estimating $J_{2}$ . Let $\bm{u}(t)=\bm{e}(t)-\tilde{\bm{e}}(t)$ . Following (22) and (23), we obtain

[TABLE]

Solving the equation above gives

[TABLE]

Consider the initializations for which $\lambda_{\min}(G^{(a)}(\Theta_{0}))\geq\frac{3\lambda^{(a)}_{n}}{4}$ . The probability of this event is no less than $1-\delta$ . For such initializations, we have

[TABLE]

Using Proposition (3.1), we conclude that with probability no less than $1-2\delta$ , the following holds:

[TABLE]

Together with the fact that $\|\bm{e}(s)\|\leq\sqrt{2n\hat{\mathcal{R}}_{n}(\Theta_{0})}e^{-\frac{m\lambda^{(a)}_{n}}{2}s}$ , we obtain

[TABLE]

In addition, for any $\bm{x}\in\SS^{d-1}$ , we have $\|\bm{g}^{(a)}(\bm{x})\|\leq\frac{1}{m\sqrt{n}}$ . Hence, plugging (3.4) into $J_{2}$ leads to

[TABLE]

Substituting in the estimates for $q_{n}$ and $\hat{\mathcal{R}}_{n}(\Theta_{0})$ , and assuming that $\beta^{2}\leq 1$ , we obtain, for any $\delta>0$ , with probability no less than $1-3\delta$ ,

[TABLE]

Finally, combining the estimates of $J_{1}$ and $J_{2}$ , we conclude that

[TABLE]

holds for any $\delta>0$ with probability at least $1-6\delta$ . This completes the proof of Theorem 3.3.

3.5 Curse of dimensionality of the implicit regularization

From (24), we have

[TABLE]

where $w_{i}(t)=m\int_{0}^{t}\tilde{e}_{i}(s)ds$ . The second term in the right hand slide of (3.5) lives in the span of $n$ fixed basis: $\{g^{(a)}(\bm{x},\bm{x}_{1}),g^{(a)}(\bm{x},\bm{x}_{2}),\cdots,g^{(a)}(\bm{x},\bm{x}_{n})\}$ .

For any probability distribution $\pi$ over $\SS^{d-1}$ , we define

[TABLE]

For any $h\in\mathcal{H}_{\pi}$ , define $\|h\|^{2}_{\mathcal{H}_{\pi}}=\mathbb{E}_{\pi}[|a^{2}(\bm{w})|]$ . As shown in [23], $\mathcal{H}_{\pi}$ is exactly the RKHS with the kernel defined by $k(\bm{x},\bm{x}^{\prime})=\mathbb{E}_{\pi}[\sigma(\bm{w}^{T}\bm{x})\sigma(\bm{w}^{T}\bm{x}^{\prime})]$ .

Definition 1 (Barron space).

The Barron space is defined as the union of $\mathcal{H}_{\pi}$ , i.e.

[TABLE]

The Barron norm for any $h\in\mathcal{B}$ is defined by

[TABLE]

To signify the dependence on the target function and data set, we introduce the notation:

[TABLE]

where the right hand side is the GD solution of the random feature model obtained by using the training data $\{\bm{x}_{i},y_{i}\}_{i=1}^{n}$ with $y_{i}=f(\bm{x}_{i})$ and $\Theta_{0}$ as the initial parameters. Let $\mathcal{B}_{Q}=\{f\in\mathcal{B}\,:\,\|f\|_{\mathcal{B}}\leq Q\}$ . We then have the following theorem.

Theorem 3.4.

There exists an absolute constant $\kappa>0$ , such that for any $t\in[0,+\infty)$

[TABLE]

Remark 6.

Combined Theorem 3.4 with Theorem 3.3, we conclude that for any $\delta\in(0,1)$ , if $m$ is sufficiently large, then with probability at least $1-\delta$ we have

[TABLE]

where

[TABLE]

denotes the solution at time $t$ of the GD dynamics for the two-layer neural network model. If $\beta$ is sufficiently small (e.g. $\beta=o(m^{-1/6})$ ), then we see that the curse of dimensionality also holds for the solutions generated by the GD dynamics for the two-layer neural network model. Since this statement holds for all time $t$ , no early-stopping strategy is able to fix this curse of dimensionality problem.

In contrast, it has been proved in [13] that an appropriate regularization can avoid this curse of dimensionality problem, i.e. if we denote by $\mathcal{M}(f,\{\bm{x}_{i}\}_{i=1}^{n})$ the estimator for the regularized model in [13] (see (86) below), then it was shown that for any $\delta>0$ , with probability at least $1-\delta$ over the sampling of $\{\bm{x}_{i}\}_{i=1}^{n}$ , the following holds

[TABLE]

The comparison between (40) and (41) provides a quantitative understanding of the insufficiency of using the random feature model to explain the generalization behavior of neural network models.

To prove Theorem 3.4, we need the following lemma, which is proved in [5].

Lemma 5.

Let $\hat{f}(\omega)$ be the Fourier transform of a function $f$ defined on $\mathbb{R}^{d}$ . Let $\Gamma_{Q}=\{f:\ \int\|\omega\|_{1}^{2}|\hat{f}(\omega)|d\omega<Q\}$ . Then, for any fixed functions $h_{1},h_{2},...,h_{n}$ , we have

[TABLE]

We now prove Theorem 3.4.

Proof.

As is shown in [7, 17], any function $f\in\Gamma_{Q}$ can be represented as

[TABLE]

for some $\pi$ and $\|f\|_{\mathcal{H}_{\pi}}\leq Q$ , which means $f\in\mathcal{B}_{Q}$ . Hence, $\Gamma_{Q}\subset\mathcal{B}_{Q}$ . Next, since the training data $\{\bm{x}_{i}\}_{i=1}^{n}$ and the initialization are fixed, we have

[TABLE]

Therefore, by Lemma 5, we obtain

[TABLE]

for some universal constant $\kappa$ . ∎

4 Analysis of the general case

In this section, we will relax the requirement of the network width. We will make the following assumption on the target function.

Assumption 2.

We assume that the target function $f^{*}$ admits the following integral representation

[TABLE]

with $\gamma(f^{*})\stackrel{{\scriptstyle\text{def}}}{{=}}\max\{1,\sup_{\bm{b}\in\SS^{d-1}}|a^{*}(\bm{b})|\}<\infty$ .

Let $\mathcal{H}_{k^{a}}$ be the RKHS induced by $k^{a}(\cdot,\cdot)$ . It was shown in [23] that $\|f^{*}\|_{\mathcal{H}_{k^{a}}}=\sqrt{\mathbb{E}_{\pi_{0}}[|a^{*}(\bm{b})|^{2}]}\leq\gamma(f^{*})$ . Thus the assumption above implies that $f^{*}\in\mathcal{H}_{k^{a}}$ .

The following approximation result is essentially the same as the ones in [23, 24]. Since we are interested in the explicit control for the norm of the solution, we provide a complete proof in Appendix A.

Lemma 6.

Assume that the target function $f^{*}$ satisfies Assumption 2. Then for any $\delta>0$ , with probability at least $1-\delta$ over the choice of $B_{0}$ , there exists $\bm{a}^{*}\in\mathbb{R}^{m}$ such that

[TABLE]

where $\mathcal{R}(\bm{a}^{*},B_{0})=\|f_{m}(\cdot;\bm{a}^{*},B_{0})-f^{*}(\cdot)\|^{2}_{\rho}$ is the population risk.

The following generalization bound for the random feature model will be used later.

Lemma 7.

For fixed $B_{0}$ , and any $\delta>0$ , with probability no less than $1-3\delta$ over the choice of the training data, we have

[TABLE]

for any $\bm{a}\in\mathbb{R}^{m}$ .

Please see Appendix B for the proof.

4.1 Optimization results

We first show that the gradient descent algorithm can reduce the empirical risk to $\mathcal{O}(\frac{1}{m}+\frac{1}{\sqrt{n}})$ . Here we will assume $m\geq n$ . This assumption is not used in the next subsection, except for Corollary 4.3.

Theorem 4.1.

Take $\beta=\frac{c}{m}$ for some absolute constant $c$ . Assume that the target function $f^{*}$ satisfies Assumption 2, and $\|f^{*}\|_{\infty}\leq 1$ . Then, for any $\delta\in(0,1)$ , with probability no less than $1-4\delta$ we have

[TABLE]

for any $t>0$ , where $C$ is a constant depending on $\delta$ , $\gamma(f^{*})$ and $c$ .

The next three lemmas give bounds on the changes of the parameters.

Lemma 8.

Let $\beta=\frac{c}{m}$ , and $T$ be a fixed constant. Then there exists constant $C_{T}$ depending on $T$ , such that for any $0\leq t\leq T$ ,

[TABLE]

and

[TABLE]

Proof.

By the gradient descent dynamics, we have

[TABLE]

Since $\|a_{k}(0)\|=\frac{c}{m}$ and $\|\bm{b}_{k}(0)\|=1$ , we have

[TABLE]

If $t\leq T$ , since $\cosh((c+1)t)\leq\frac{e^{(c+1)T}+1}{2}$ and $\sinh((c+1)t)\leq\frac{e^{(c+1)T}+1}{2}t$ , we have

[TABLE]

with $C_{T}=\frac{e^{(c+1)T}+1}{2}$ . Hence, we have

[TABLE]

For $\|B_{t}-B_{0}\|$ , consider a more refined estimate

[TABLE]

Plugging in the above estimate for $a_{k}$ , we obtain

[TABLE]

∎

Lemma 9.

Let $\gamma=\gamma(f^{*})$ , $\beta=\frac{c}{m}$ , and assume $\sqrt{m}\geq\gamma$ . Then, for any $\delta>0$ , with probability no less than $1-4\delta$ , we have for any $0\leq t\leq T$ ,

[TABLE]

where $\tilde{C}_{T}$ is a constant.

Proof.

For $\|\tilde{\bm{a}}_{t}\|$ , consider the Lyapunov function

[TABLE]

Since $\hat{\mathcal{R}}_{n}(\tilde{\bm{a}}_{t},B_{0})$ is convex with respect to $\tilde{\bm{a}}_{t}$ , we have $\frac{d}{dt}J(t)\leq 0$ , which implies $J(t)\leq J(0)$ . Hence we have

[TABLE]

Since $\hat{\mathcal{R}}_{n}(\tilde{\bm{a}}_{t},B_{0})\geq 0$ , we obtain

[TABLE]

By Lemma 6 and Lemma 7, when $\sqrt{m}\geq\gamma$ , with probability no less than $1-4\delta$ , we have

[TABLE]

Therefore we have

[TABLE]

Let $\tilde{C}=\max\{\sqrt{4\gamma^{2}+2c^{2}},2\sqrt{2}(2\gamma+1)\left(1+\sqrt{2\log(\frac{4\sqrt{m}}{\gamma\delta})}\right)\}$ , we get

[TABLE]

∎

Lemma 10.

Under the assumptions of Lemmas 8 and 9, for any $0\leq t\leq T$ , we have

[TABLE]

for some constant $C_{T}$ .

Proof.

Note that

[TABLE]

Multiplying $\bm{a}_{t}-\tilde{\bm{a}}_{t}$ on both sides of (60), we get

[TABLE]

Using the estimates in Lemmas 8 and 9, we obtain

[TABLE]

∎

Proof of Theorem 4.1

Let $\hat{\rho}=\frac{1}{n}\sum_{i=1}^{n}\delta_{\bm{x}_{i}}$ , then we have

[TABLE]

By Cauchy-Schwartz, we have

[TABLE]

For $\hat{\mathcal{R}}_{n}(\tilde{\bm{a}}_{t},B_{0})$ , from Lemma 6, with probability $1-\delta$ , there exists $\bm{a}^{*}$ that satisfies (44). Thus we have

[TABLE]

By Lemma 7, we can bound $I_{2}$ as follows,

[TABLE]

For $I_{1}$ , consider the Lyapunov function

[TABLE]

Since $\hat{\mathcal{R}}_{n}(\tilde{\bm{a}}_{t},B_{0})$ is convex with respect to $\tilde{\bm{a}}_{t}$ , we have $\frac{d}{dt}J(t)\leq 0$ , which implies $J(t)\leq J(0)$ . Hence we have

[TABLE]

Combining all the estimates above, we conclude that for any $\delta>0$ , with probability larger than $1-4\delta$ , we have

[TABLE]

For the estimate on $\bm{a}^{*}$ , by Lemma 6, we have $\|\bm{a}^{*}\|\leq\frac{\gamma}{\sqrt{m}}$ . To bound $\|\bm{a}_{0}-\bm{a}^{*}\|$ , we have

[TABLE]

Together with the estimates in Lemmas 8, 9 and 10, and without loss of generality assuming that $\gamma\geq 1$ , we obtain

[TABLE]

for $t\in[0,T]$ , and some constant $C$ (we can choose $C=27C_{T}^{6}\tilde{C}^{2}_{T}(c+1)^{8}(2\gamma+1)^{2}$ ).

If we assume $m\geq n$ and take $t\in[0,\frac{\sqrt{n}}{m}]$ , then we can take $T=1$ and obtain

[TABLE]

for some constant $C$ . Moreover, since $\hat{\mathcal{R}}_{n}(\bm{a}_{t},B_{t})$ is non-increasing, $\hat{\mathcal{R}}_{n}(\bm{a}_{t},B_{t})\leq\hat{\mathcal{R}}_{n}(\bm{a}_{\sqrt{n}/m},B_{\sqrt{n}/m})$ . Hence for any $t>\frac{\sqrt{n}}{m}$ , we have

[TABLE]

for some constant $C$ . Combining (73) and (74), we complete the proof for all $t$ .

4.2 Generalization results

The following theorem provides an upper bound for the population error of GD solutions at any time $t\in[0,\infty)$ . It tells that one can use early stopping to reach the optimal error in the absence of over-parametrization.

Theorem 4.2.

Take $\beta=\frac{c}{m}$ for some constant $c$ . Assume that the target function $f^{*}$ satisfies Assumption 2, and $\|f^{*}\|_{\infty}\leq 1$ . Fix any positive constant $T$ . Then for $\delta>0$ , with probability no less than $1-4\delta$ we have, for $t\leq T$

[TABLE]

where $C$ is a constant depending only on $T$ , $\delta$ , $\gamma(f^{*})$ and $c$ .

As a consequence, we have the following early-stopping results.

Corollary 4.3 (Early-stopping solution).

Assume that $m>n$ . Let $t=\frac{\sqrt{n}}{m}$ . Under the condition of Theorem 4.2, we have

[TABLE]

Remark 7.

From these results we conclude that for target functions in a certain RKHS, with high probability the gradient descent dynamics can find a solution with good generalization properties in a short time. Compared to the long-term analysis in the last section, this theorem does not require $m$ to be very large. It works in the “mildly over-parameterized” regime.

The following Corollary provides a more detailed study of the balance between $m$ , $n$ and $t$ to achieve best rates for $\mathcal{R}(\bm{a}_{t},B_{t})$ .

Corollary 4.4.

Assume $m=n^{p}$ for some $p\geq 0$ . Then, if $p\leq\frac{7}{8}$ , take $t=n^{-\frac{3p}{7}}$ , we have

[TABLE]

If $p>\frac{7}{8}$ , take $t=n^{-p+\frac{1}{2}}$ , we have

[TABLE]

Proof.

Let $m=n^{p}$ and $t=n^{r}$ . We assume $r\leq 0$ , then

[TABLE]

Expand the right hand side of (75), we obtain

[TABLE]

For each $p\geq 0$ , we are going to find the corresponding $r$ for which the maximum value among all the terms at the right hand side of (80) is minimized. When $r=-p$ , we have $-r-p=0$ . Thus the second term is larger than any other terms. Hence, we only have to consider the case when $-p\leq r\leq 0$ . In this interval, we only need to compare the terms with powers $-r-p$ , $r+p-1$ , $6r+2p$ and $7r+3p-\frac{1}{2}$ and neglect all other terms. The desired results are then obtained by comparing the second term with the other three terms. ∎

Now we prove Theorem 4.2.

Proof.

Similar to (63), we have

[TABLE]

Here $\rho$ is the distribution of input data $\bm{x}$ . For the first two terms in (81), we have the same estimates as in (64) and (65). For $\mathcal{R}(\tilde{\bm{a}}_{t},B_{0})$ , we have

[TABLE]

The right hand side of (82) has one more term than (66), and additional term can be bounded as

[TABLE]

Hence, for any $\delta>0$ , with probability larger than $1-4\delta$ , we have

[TABLE]

Using the estimates of $\|\bm{a}_{t}-\tilde{\bm{a}}_{t}\|$ , $\|B_{t}\|$ , $\|\tilde{\bm{a}}_{t}\|$ , $\|B_{t}-B_{0}\|$ , $\|\bm{a}^{*}\|$ and $\|\bm{a}_{0}-\bm{a}^{*}\|$ derived in the previous lemmas, and assuming that $\frac{1+\sqrt{t}}{\sqrt{m}}+\frac{\sqrt{t}}{n^{1/4}}\leq 1$ , we obtain

[TABLE]

In (4.2), the constant $C$ can be chosen as $C=27C_{T}^{6}\tilde{C}^{2}(c+1)^{8}(2\gamma+1)^{2}$ .

∎

5 Numerical experiments

In this section, we present some numerical results to illustrate our theoretical analysis.

5.1 Fitting random labels

The first experiment studies the convergence of GD dynamics for over-parametrized two-layer neural networks with different initializations. We uniformly sample $\{\bm{x}_{i}\}_{i=1}^{n}$ from $\SS^{d-1}$ , and for each $\bm{x}_{i}$ we specify a label $y_{i}$ , which is uniformly drawn from $[-1,1]$ . In the experiments, we choose $n=50,d=50$ , and network width $m=10,000\gg n$ . Six initializations of different magnitudes are tested. Figure 1 shows the training curves.

We see that the GD algorithm for the neural network models converges exponentially fast for all initializations considered, even for the case when $\beta=m$ . This is consistent with the results of Theorem 3.2.

5.2 Learning the one-neuron function

The next experiment compares the GD dynamics of two-layer neural networks and random feature models. We consider the target function $f^{*}(\bm{x}):=\sigma(\bm{e}_{1}^{T}\bm{x})$ with $\bm{e}_{1}=(1,0,\cdots,0)^{T}\in\mathbb{R}^{d}$ . The training set is given by $\{(\bm{x}_{i},f^{*}(\bm{x}_{i}))\}_{i=1}^{n}$ , with $\{\bm{x}_{i}\}_{i=1}^{n}$ independently drawn from $\SS^{d-1}$ .

We first choose $n=50,d=10$ to build the training set, and then use the gradient descent algorithm with learning rate $\eta=0.01$ to train two-layer neural network and random feature models. We initialize the models using $\beta=0$ . In addition, $10^{4}$ new samples are drawn to evaluate the test error. Figure 2 shows the training and test error curves of the two models of three widths: $m=4,50,1000$ . We see that, when the width is very small, the GD algorithm for the random feature model does not converge, while it does converge for the neural network model and the resulting model does generalize. This is likely due to the special target function we have chosen here. For the intermediate width ( $m=50$ ), the GD algorithm for both models converges, and it converges faster for the neural network model than for the random feature model. The test accuracy is slightly better for the resulting neural network model (but not as good as for the case when $m=4$ ). When $m=1000$ , the behavior of the GD algorithm for two models is almost the same.

Finally, we study the generalization properties of neural network models of different width. We train two-layer neural networks of different width until the training error is below $10^{-5}$ . Then we measure the test error. We compare the test error with that of the regularized model proposed in [13]:

[TABLE]

where

[TABLE]

The results are showed in Figure 3. One sees that when the width is small, the test error is small for both methods. However, when the width becomes very large, the un-regularized neural network model does not generalized well. In other words, implicit regularization fails.

The above results are consistent with the theoretical lower bound (40), which states that learning with GD suffers from the curse of dimensionality for functions in Barron space. Here the one-neuron function serves as a specific example. Intuitively, the one-neuron target function $f^{*}(x)=\sigma((\bm{w}^{*})^{T}\bm{x})$ only relies on the specific direction $\bm{w}^{*}$ . However the basis $\{\sigma(\bm{w}^{T}\bm{x})\}_{j=1}^{m}$ are uniformly drawn from $\SS^{d-1}$ . In high dimension, we know $\langle\bm{w}_{j},\bm{w}^{*}\rangle\approx 0$ for any $\bm{w}_{j}$ uniformly drawn from $\SS^{d-1}$ . Therefore, it is not surprising to see that learning with uniform features suffers from the curse of dimensionality.

6 Conclusion

To put things into perspective, let us first recall some results from [13].

One can define a space of functions called the Barron space. The Barron space is the union of all RKHS with kernels defined by

[TABLE]

with respect to all probability distributions $\pi$ . 2. 2.

For regularized models with a suitably crafted regularization term, optimal generalization error estimates (i.e. rates that are comparable to the Monte Carlo rates) can be established for all target functions in the Barron space.

In the present paper, we have shown that for over-parametrized two-layer neural networks without explicit regularization, the gradient descent algorithm is sufficient for the purpose of optimization. But to obtain dimension-independent error rates for generalization, one has to require that the target function be in the RKHS with a kernel defined by the initialization. In other words, given a target function in the Barron space, in order for implicit regularization to work, one has to know beforehand the kernel function for that target function and use that kernel function to initialize the GD algorithm. This requirement is certainly impractical. In the absence of such a knowledge, one should expect to encounter the curse of dimensionality for general target functions in Barron space, as is proved in this paper.

We have also studied the case with general network width. Our results point to the same direction as for the over-parametrized regime although in the general case, one has to rely on early stopping to obtain good generalization error bounds. Our analysis does not rule out completely the possibility that in some scaling regimes of $n,m,t$ , the GD algorithm for two-layer neural network models may have better generalization properties than that of the related kernel method.

Our analysis was carried out under special circumstances, e.g. with a particular choice of $\pi_{0}$ and a very special domain $\mathbb{S}^{d-1}$ for the input. While it is certainly possible to extend this analysis to more general settings, we feel that the value of such an analysis is limited since our main message is a negative one: Without explicit regularization, the generalization properties of two-layer neural networks are likely to be no better than that of the kernel method.

From a technical viewpoint, our analysis was facilitated greatly by the fact that the dynamics of the $\bm{b}$ ’s is much slower than that of the $\bm{a}$ ’s, as a consequence of the smallness of $\beta$ . As a result, the $\bm{b}$ ’s are effectively frozen in the GD dynamics. While this is the same setup as the ones used in practice, one can also imagine putting out an explicit scaling factor to account for the smallness of $\beta$ , e.g.

[TABLE]

as in [28, 25, 27]. In this case, the separation of time scales is no longer valid and one can potentially obtain a very different picture. While this is certainly an interesting avenue to pursue, so far there are no results concerning the effect of implicit regularization in such a setting.

Acknowledgement: The work presented here is supported in part by a gift to Princeton University from iFlytek and the ONR grant N00014-13-1-0338.

Appendix A Proof of Lemma 6

Proof.

For any $B_{0}$ , let $\bm{a}^{*}(B_{0})=\{a^{*}(\bm{b}_{k}^{0})/m\}_{k=1}^{m}$ , where $a^{*}$ is the function defined in Assumption 2. Let

[TABLE]

Then we have $\mathbb{E}_{B_{0}}f(\bm{x};A^{*}(B_{0}),B_{0})=f^{*}(\bm{x})$ . Now, consider

[TABLE]

then if $\tilde{B}_{0}$ is different from $B_{0}$ at only one $\bm{b}_{k}^{0}$ , we have

[TABLE]

Hence, by McDiarmid’s inequality, for any $\delta>0$ , with probability no less than $1-\delta$ , we have

[TABLE]

On the other hand,

[TABLE]

Therefore, we have

[TABLE]

Finally, by Assumption 2, $\|\bm{a}^{*}\|\leq\frac{\gamma}{\sqrt{m}}$ . ∎

Appendix B Proof of Lemma 7

For any $Q>0$ , let $\mathcal{F}_{Q}=\{f(\cdot;\bm{a},B_{0}):\ \|\bm{a}\|\leq Q\}$ . We can bound the Rademacher complexity of $\mathcal{F}_{Q}$ as follows.

[TABLE]

Next, let $\mathcal{H}_{Q}=\{(f(\cdot;\bm{a},B_{0})-f^{*})^{2}:\ \|\bm{a}\|\leq Q\}$ . Since $|f^{*}(\bm{x})|\leq 1$ for any $\bm{x}$ , by the Cauchy-Schwartz inequality, $|f(\bm{x};\bm{a},B_{0})|\leq\sqrt{m}Q$ . Hence we can bound the Rademacher complexity of $\mathcal{H}_{Q}$ by

[TABLE]

using that $(f(\cdot;\bm{a},B_{0})-f^{*})^{2}$ is Lipschitz continuous with Lipschitz constant bounded by $2\sqrt{m}Q+1$ . Therefore, for any $\delta>0$ , with probability larger than $1-\delta$ , we have

[TABLE]

for any $\bm{a}$ with $\|\bm{a}\|\leq Q$ .

Finally, for any integer $k$ , let $Q_{k}=2^{k}$ and $\delta_{k}=2^{-|k|}\delta$ . Then, with probability larger than

[TABLE]

we have that (96) holds for all $Q=Q_{k}$ . Given any $\bm{a}\in\mathbb{R}^{m}$ , we can find a $Q_{k}$ such that $Q_{k}\leq 2\|\bm{a}\|$ , which means

[TABLE]

This completes the proof.

Appendix C Proof of Lemma 1

Proof.

Define $\mathcal{F}=\{h(a,\bm{b})=a\sigma(\bm{b}^{T}\bm{x})\,:\,\|\bm{x}\|\leq 1\}$ . By the standard Rademacher complexity bound (see Theorem 26.5 of [26]), we have, with probability at least $1-\delta$ ,

[TABLE]

Moreover, since $\phi_{k}(\cdot)\stackrel{{\scriptstyle\text{def}}}{{=}}a_{k}\sigma(\cdot)$ is $\beta-$ Lipschitz continuous, by applying the contraction property of Rademacher complexity (see Lemma 26.9 of [26]) we have

[TABLE]

where the last inequality follows from the Lemma 26.10 of [26]. Thus with probability $1-\delta$ , we have that for any $\|\bm{x}\|=1$ ,

[TABLE]

Thus $\hat{\mathcal{R}}_{n}(\Theta_{0})\leq\frac{1}{2n}\sum_{i=1}^{n}(1+|f(\bm{x}_{i};\Theta_{0})|)^{2}\leq\frac{1}{2}(1+\sqrt{m}\beta(2+\sqrt{\ln(1/\delta)}))^{2}$ . ∎

Appendix D Proof of Lemma 2

Proof.

For a given $\varepsilon\geq 0$ , define events

[TABLE]

Hoeffding’s inequality gives us that

[TABLE]

Thus with probability at least $(1-e^{-2m\varepsilon^{2}})^{2n^{2}}\geq 1-2n^{2}e^{-2m\varepsilon^{2}}$ , we have

[TABLE]

Using Weyl’s Theorem, we have

[TABLE]

Taking $\varepsilon=\lambda_{n}/4$ , we complete the proof. ∎

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. ar Xiv preprint ar Xiv:1811.04918 , 2018.
2[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962 , 2018.
3[3] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society , 68(3):337–404, 1950.
4[4] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584 , 2019.
5[5] Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory , 39(3):930–945, 1993.
6[6] Mikio L Braun. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research , 7(Nov):2303–2328, 2006.
7[7] Leo Breiman. Hinging hyperplanes for regression, classification, and function approximation. IEEE Transactions on Information Theory , 39(3):999–1013, 1993.
8[8] Yuan Cao and Quanquan Gu. A generalization theory of gradient descent for learning over-parameterized deep Re LU networks. ar Xiv preprint ar Xiv:1902.01384 , 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Comparative Analysis of Optimization and Generalization Properties of

Abstract

1 Introduction

1.1 Related work

2 Preliminaries

2.1 Problem setup

Gradient Descent

Initialization

2.2 Assumption on the input data

Assumption 1**.**

Remark 1**.**

2.3 The random feature model

3 Analysis of the over-parameterized case

3.1 Properties of the initialization

Lemma 1**.**

Lemma 2**.**

3.2 Gradient descent near the initialization

Lemma 3**.**

Proof.

Proposition 3.1**.**

Proof.

Lemma 4**.**

3.3 Global convergence for arbitrary labels

Theorem 3.2**.**

Proof.

Remark 2**.**

Remark 3**.**

3.4 Characterization of the whole GD trajectory

Theorem 3.3**.**

Remark 4**.**

Remark 5**.**

Proof of Theorem 3.3

3.5 Curse of dimensionality of the implicit regularization

Definition 1** (Barron space).**

Theorem 3.4**.**

Remark 6**.**

Lemma 5**.**

Proof.

4 Analysis of the general case

Assumption 2**.**

Lemma 6**.**

Lemma 7**.**

4.1 Optimization results

Theorem 4.1**.**

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

Lemma 10**.**

Proof.

Proof of Theorem 4.1

4.2 Generalization results

Theorem 4.2**.**

Corollary 4.3** (Early-stopping solution).**

Remark 7**.**

Corollary 4.4**.**

Proof.

Proof.

5 Numerical experiments

5.1 Fitting random labels

5.2 Learning the one-neuron function

6 Conclusion

Appendix A Proof of Lemma 6

Proof.

Appendix B Proof of Lemma 7

Appendix C Proof of Lemma 1

Proof.

Appendix D Proof of Lemma 2

Proof.

Assumption 1.

Remark 1.

Lemma 1.

Lemma 2.

Lemma 3.

Proposition 3.1.

Lemma 4.

Theorem 3.2.

Remark 2.

Remark 3.

Theorem 3.3.

Remark 4.

Remark 5.

Definition 1 (Barron space).

Theorem 3.4.

Remark 6.

Lemma 5.

Assumption 2.

Lemma 6.

Lemma 7.

Theorem 4.1.

Lemma 8.

Lemma 9.

Lemma 10.

Theorem 4.2.

Corollary 4.3 (Early-stopping solution).

Remark 7.

Corollary 4.4.