Training Neural Networks as Learning Data-adaptive Kernels: Provable   Representation and Approximation Benefits

Xialiang Dou; Tengyuan Liang

arXiv:1901.07114·stat.ML·July 27, 2020

Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits

Xialiang Dou, Tengyuan Liang

PDF

TL;DR

This paper demonstrates that neural networks trained with gradient flow adaptively learn a kernel representation, enabling better approximation and generalization compared to fixed basis methods, with formal proofs and convergence results.

Contribution

It introduces a dynamic RKHS framework showing neural networks learn an adaptive kernel and perform optimal projections, formalizing their representation and approximation advantages.

Findings

01

Neural networks learn an adaptive RKHS during training.

02

Gradient flow performs the global least-squares projection onto the adaptive RKHS.

03

Neural network functions converge to kernel ridgeless regression with an adaptive kernel.

Abstract

Consider the problem: given the data pair $(x, y)$ drawn from a population with $f_{*} (x) = E [y ∣ x = x]$ , specify a neural network model and run gradient flow on the weights over time until reaching any stationarity. How does $f_{t}$ , the function computed by the neural network at time $t$ , relate to $f_{*}$ , in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. Firstly, we show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation and performs the global least-squares projection onto the…

Figures12

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1 : Nature of the results studied in this paper.

	finite neurons $m$	infinite neurons $m \to \infty$
finite samples $n$	Interpolation (finite rank kernel, Thms. 3.1, 3.2 & Prop. 4.1)	Interpolation (finite rank kernel, Thms. 3.1, 3.2 & Prop. 4.1)
infinite samples $n \to \infty$	Approximation (finite rank kernel, Thms. 3.1 & 3.2)	Approximation (possibly universal kernel²²2Whether the kernel is universal in the $m, n \to \infty$ case still depends on $f_{*}$ and the data distribution $P$ . See the simulations of Maennel et al. [2018]., Thms. 3.1 & 3.2)

Equations265

L (f) = E_{(x, y) \sim P} \frac{1}{2} (y - f (x))^{2} = E_{x \sim P_{x}} \frac{1}{2} (f_{*} (x) - f (x))^{2} + E_{(x, y) \sim P} \frac{1}{2} (y - f_{*} (x))^{2},

L (f) = E_{(x, y) \sim P} \frac{1}{2} (y - f (x))^{2} = E_{x \sim P_{x}} \frac{1}{2} (f_{*} (x) - f (x))^{2} + E_{(x, y) \sim P} \frac{1}{2} (y - f_{*} (x))^{2},

f_{t} (x) \mathchar 58 = j = 1 \sum m w_{j} (t) σ (x^{T} u_{j} (t)) .

f_{t} (x) \mathchar 58 = j = 1 \sum m w_{j} (t) σ (x^{T} u_{j} (t)) .

\frac{d w _{j} ( t )}{d t} = - E_{z} [\frac{\partial ℓ ( y , f _{t} )}{\partial f} σ (x^{T} u_{j} (t))], \frac{d u _{j} ( t )}{d t} = - E_{z} [\frac{\partial ℓ ( y , f _{t} )}{\partial f} w_{j} (t) \mathbbm 1_{x^{T} u_{j} (t) \geq 0} x] .

\frac{d w _{j} ( t )}{d t} = - E_{z} [\frac{\partial ℓ ( y , f _{t} )}{\partial f} σ (x^{T} u_{j} (t))], \frac{d u _{j} ( t )}{d t} = - E_{z} [\frac{\partial ℓ ( y , f _{t} )}{\partial f} w_{j} (t) \mathbbm 1_{x^{T} u_{j} (t) \geq 0} x] .

f_{t} (x) \mathchar 58 = \int σ (x^{T} u) τ_{t} (d u),

f_{t} (x) \mathchar 58 = \int σ (x^{T} u) τ_{t} (d u),

E_{z \sim P} (y - f_{t} (x))^{2} = ∥ f_{t} - f_{*} ∥_{L_{μ}^{2}}^{2} + E_{z \sim P} (y - f_{*} (x))^{2} .

E_{z \sim P} (y - f_{t} (x))^{2} = ∥ f_{t} - f_{*} ∥_{L_{μ}^{2}}^{2} + E_{z \sim P} (y - f_{*} (x))^{2} .

\frac{1}{2 n} i = 1 \sum n (y_{i} - f_{t} (x_{i}))^{2} .

\frac{1}{2 n} i = 1 \sum n (y_{i} - f_{t} (x_{i}))^{2} .

(T f) (Θ) \mathchar 58 = \int f (x) ∥Θ∥ σ (x^{T} Θ) μ (d x), \forallΘ \in supp (ρ_{t}) .

(T f) (Θ) \mathchar 58 = \int f (x) ∥Θ∥ σ (x^{T} Θ) μ (d x), \forallΘ \in supp (ρ_{t}) .

(T^{⋆} p) (x) \mathchar 58 = \int p (Θ) ∥Θ∥ σ (x^{T} Θ) ∣ ρ_{t} ∣ (d Θ) .

(T^{⋆} p) (x) \mathchar 58 = \int p (Θ) ∥Θ∥ σ (x^{T} Θ) ∣ ρ_{t} ∣ (d Θ) .

H_{t} (x, \tilde{x}) = \int ∥Θ ∥^{2} σ (x^{T} Θ) σ (\tilde{x}^{T} Θ) ∣ ρ_{t} ∣ (d Θ), and (T^{⋆} T f) (x) \mathchar 58 = \int H_{t} (x, \tilde{x}) f (\tilde{x}) μ (d \tilde{x}) .

H_{t} (x, \tilde{x}) = \int ∥Θ ∥^{2} σ (x^{T} Θ) σ (\tilde{x}^{T} Θ) ∣ ρ_{t} ∣ (d Θ), and (T^{⋆} T f) (x) \mathchar 58 = \int H_{t} (x, \tilde{x}) f (\tilde{x}) μ (d \tilde{x}) .

H_{t} = {h ∣ h (x) = i \sum h_{i} e_{i} (x), i \sum \frac{h _{i}^{2}}{λ _{i}} < \infty} .

H_{t} = {h ∣ h (x) = i \sum h_{i} e_{i} (x), i \sum \frac{h _{i}^{2}}{λ _{i}} < \infty} .

f_{\infty} \in g \in H_{\infty} arg min ∥ f_{*} - g ∥_{L_{μ}^{2}}^{2} .

f_{\infty} \in g \in H_{\infty} arg min ∥ f_{*} - g ∥_{L_{μ}^{2}}^{2} .

f_{*} = f_{\infty} + Δ_{\infty} .

f_{*} = f_{\infty} + Δ_{\infty} .

f_{\infty} \in H_{\infty}, Δ_{\infty} \in Ker (K_{\infty}) \subset Ker (H_{\infty}),

f_{\infty} \in H_{\infty}, Δ_{\infty} \in Ker (K_{\infty}) \subset Ker (H_{\infty}),

\frac{1}{2 n} i = 1 \sum n (y_{i} - f_{t} (x_{i}))^{2} + \frac{λ}{2 m} j = 1 \sum m [w_{j} (t)^{2} + ∥ u_{j} (t) ∥^{2}] .

\frac{1}{2 n} i = 1 \sum n (y_{i} - f_{t} (x_{i}))^{2} + \frac{λ}{2 m} j = 1 \sum m [w_{j} (t)^{2} + ∥ u_{j} (t) ∥^{2}] .

λ \to 0 lim f_{\infty}^{nn, λ} (x) = H_{\infty} (x, X) H_{\infty} (X, X)^{+} Y = \mathchar 58 f_{\infty}^{rkhs} (x) (ridgeless regression with kernel H_{\infty}) .

λ \to 0 lim f_{\infty}^{nn, λ} (x) = H_{\infty} (x, X) H_{\infty} (X, X)^{+} Y = \mathchar 58 f_{\infty}^{rkhs} (x) (ridgeless regression with kernel H_{\infty}) .

f_{\infty} (x) = \int ∥Θ∥ σ (x^{T} Θ) ρ_{\infty}^{(m)} (d Θ);

f_{\infty} (x) = \int ∥Θ∥ σ (x^{T} Θ) ρ_{\infty}^{(m)} (d Θ);

f_{\infty} \in g \in H_{\infty} arg min ∥ f_{*} - g ∥_{L_{μ}^{2}}^{2} .

f_{\infty} \in g \in H_{\infty} arg min ∥ f_{*} - g ∥_{L_{μ}^{2}}^{2} .

Δ_{\infty} (x) = f_{*} (x) - f_{\infty} (x) \in Ker (H_{\infty}) .

Δ_{\infty} (x) = f_{*} (x) - f_{\infty} (x) \in Ker (H_{\infty}) .

f_{\infty} \in g \in H_{\infty} arg min \frac{1}{n} i = 1 \sum n (y_{i} - g (x_{i}))^{2} .

f_{\infty} \in g \in H_{\infty} arg min \frac{1}{n} i = 1 \sum n (y_{i} - g (x_{i}))^{2} .

K_{\infty} (x, \tilde{x}) = \int (∥Θ ∥^{2} \mathbbm 1_{x^{T} Θ \geq 0} \mathbbm 1_{\tilde{x}^{T} Θ \geq 0} x^{T} \tilde{x} + σ (x^{T} Θ) σ (\tilde{x}^{T} Θ)) ∣ ρ_{\infty} ∣ (d Θ) \neq = H_{\infty} (x, \tilde{x})

K_{\infty} (x, \tilde{x}) = \int (∥Θ ∥^{2} \mathbbm 1_{x^{T} Θ \geq 0} \mathbbm 1_{\tilde{x}^{T} Θ \geq 0} x^{T} \tilde{x} + σ (x^{T} Θ) σ (\tilde{x}^{T} Θ)) ∣ ρ_{\infty} ∣ (d Θ) \neq = H_{\infty} (x, \tilde{x})

(K_{t} f) (x) \mathchar 58 = \int K_{t} (x, \tilde{x}) f (\tilde{x}) μ (d \tilde{x}) .

(K_{t} f) (x) \mathchar 58 = \int K_{t} (x, \tilde{x}) f (\tilde{x}) μ (d \tilde{x}) .

f_{*} = f_{\infty} + Δ_{\infty} .

f_{*} = f_{\infty} + Δ_{\infty} .

f_{\infty} \in H_{\infty}, Δ_{\infty} \in Ker (K_{\infty}) \subset Ker (H_{\infty}),

f_{\infty} \in H_{\infty}, Δ_{\infty} \in Ker (K_{\infty}) \subset Ker (H_{\infty}),

H_{\infty} (x, \tilde{x}) = \int σ (x^{T} Θ) σ (\tilde{x}^{T} Θ) ∣ ρ_{\infty} ∣ (d Θ)

H_{\infty} (x, \tilde{x}) = \int σ (x^{T} Θ) σ (\tilde{x}^{T} Θ) ∣ ρ_{\infty} ∣ (d Θ)

K_{\infty} (x, \tilde{x}) = \int (∥Θ ∥^{2} \mathbbm 1_{x^{T} Θ \geq 0} \mathbbm 1_{\tilde{x}^{T} Θ \geq 0} x^{T} \tilde{x} + σ (x^{T} Θ) σ (\tilde{x}^{T} Θ)) ∣ ρ_{\infty} ∣ (d Θ) .

K_{\infty} (x, \tilde{x}) = \int (∥Θ ∥^{2} \mathbbm 1_{x^{T} Θ \geq 0} \mathbbm 1_{\tilde{x}^{T} Θ \geq 0} x^{T} \tilde{x} + σ (x^{T} Θ) σ (\tilde{x}^{T} Θ)) ∣ ρ_{\infty} ∣ (d Θ) .

H_{\infty} (X, X) = n \times 1 σ (X Θ_{\infty}^{T}) 1 \times n σ (X Θ_{\infty}^{T})^{T}

H_{\infty} (X, X) = n \times 1 σ (X Θ_{\infty}^{T}) 1 \times n σ (X Θ_{\infty}^{T})^{T}

K_{\infty} (X, X) ⪰ n \times d diag (\mathbbm 1_{X Θ_{\infty}^{T} \geq 0}) X d \times n X^{T} diag (\mathbbm 1_{X Θ_{\infty}^{T} \geq 0})

K_{\infty} (X, X) ⪰ n \times d diag (\mathbbm 1_{X Θ_{\infty}^{T} \geq 0}) X d \times n X^{T} diag (\mathbbm 1_{X Θ_{\infty}^{T} \geq 0})

\frac{1}{2 n} i = 1 \sum n (y_{i} - f_{t} (x_{i}))^{2} + \frac{λ}{2 m} j = 1 \sum m [w_{j} (t)^{2} + ∥ u_{j} (t) ∥^{2}] .

\frac{1}{2 n} i = 1 \sum n (y_{i} - f_{t} (x_{i}))^{2} + \frac{λ}{2 m} j = 1 \sum m [w_{j} (t)^{2} + ∥ u_{j} (t) ∥^{2}] .

f_{\infty}^{nn, λ} (x) = H_{\infty}^{λ} (x, X) [\frac{n}{m} λ \cdot I_{n} + H_{\infty}^{λ} (X, X)]^{- 1} Y .

f_{\infty}^{nn, λ} (x) = H_{\infty}^{λ} (x, X) [\frac{n}{m} λ \cdot I_{n} + H_{\infty}^{λ} (X, X)]^{- 1} Y .

λ \to 0 lim f_{\infty}^{nn, λ} (x) = H_{\infty} (x, X) H_{\infty} (X, X)^{+} Y = f_{\infty}^{rkhs} (x) .

λ \to 0 lim f_{\infty}^{nn, λ} (x) = H_{\infty} (x, X) H_{\infty} (X, X)^{+} Y = f_{\infty}^{rkhs} (x) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits

Xialiang Dou

Department of Statistics, University of Chicago

Tengyuan Liang

Booth School of Business, University of Chicago

Liang gratefully acknowledges support from the George C. Tiao Fellowship.

Abstract

Consider the problem: given the data pair $(\mathbf{x},\mathbf{y})$ drawn from a population with $f_{*}(x)=\mathbf{E}[\mathbf{y}|\mathbf{x}=x]$ , specify a neural network model and run gradient flow on the weights over time until reaching any stationarity. How does $f_{t}$ , the function computed by the neural network at time $t$ , relate to $f_{*}$ , in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. Firstly, we show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation and performs the global least-squares projection onto the adaptive RKHS, simultaneously. Secondly, we prove that as the RKHS is data-adaptive and task-specific, the residual for $f_{*}$ lies in a subspace that is potentially much smaller than the orthogonal complement of the RKHS. The result formalizes the representation and approximation benefits of neural networks. Lastly, we show that the neural network function computed by gradient flow converges to the kernel ridgeless regression with an adaptive kernel, in the limit of vanishing regularization. The adaptive kernel viewpoint provides new angles of studying the approximation, representation, generalization, and optimization advantages of neural networks.

Keywords: adaptive estimation, neural networks, reproducing kernel Hilbert space, gradient flow dynamics, representation learning, algorithmic approximation, interpolation.

1 Introduction

Consider i.i.d. data pairs drawn from a joint distribution $(\mathbf{x},\mathbf{y})\sim P=P_{x}\times P_{y|x}$ on the space $\mathcal{X}\times\mathcal{Y}$ . At the intersection of statistical learning theory [Vapnik, 1998] and approximation theory [Cybenko, 1989], the following approximation problem requires to be first understood, before any further statistical results to be established. For a model class $\mathcal{F}$ , one is interested in whether there exists $f\in\mathcal{F}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\rightarrow\mathcal{Y}$ such that the population squared loss is small,

[TABLE]

with the conditional expectation (or Bayes estimator) defined as $f_{*}(x)\mathrel{\mathop{\mathchar 58\relax}}=\mathop{\mathbf{E}}[\mathbf{y}|\mathbf{x}=x]$ . Eqn. (1.1) generally reads as approximating $f_{*}$ in the mean squared error sense.

Statistically, researchers approach the above question mainly in two ways. The first is by assuming that the conditional expectation $f_{*}$ lies in the correct model class $\mathcal{F}$ . For example, say $\mathcal{F}$ consists of linear models or splines with a particular order of smoothness, or more broadly functions lying in a reproducing kernel Hilbert space (RKHS). Conceptually, this “well-specification” assumption requires substantial knowledge about what model class $\mathcal{F}$ might be suitable for the regression task at hand, which is often unavailable in practice. Within each framework, minimax optimal rates and extensive study have been established in [Stone, 1980, Wahba, 1990]. The second way, which extends the first approach further, considers all $f_{*}$ under some mild conditions. Building upon certain universal approximation theorem, one studies a sequence of model classes $\mathcal{F}_{\epsilon}$ called sieves with $\epsilon$ changing [Geman and Hwang, 1982], such that the class $\mathcal{F}_{\epsilon}$ contains an $\epsilon$ -approximation to any $f_{*}$ under some metric. A final result usually requires a careful balancing of the approximation and stochastic error by tuning $\epsilon$ . Particular cases for the latter approach include polynomials (Stone-Weierstrass, Bernstein), radial-basis [Park and Sandberg, 1991, Niyogi and Girosi, 1996], and two-layer and multi-layer neural networks [Cybenko, 1989, Hornik et al., 1989, Anthony and Bartlett, 2009, Rahimi and Recht, 2008, Daniely et al., 2016, Bach, 2017, Farrell et al., 2018, Koehler and Risteski, 2018, Poggio et al., 2017].

However, the following significant drawbacks of the above current theory make it inadequate to present an adaptive and realistic explanation of the practical success of neural networks. Firstly, the function computed in practice could be very different from that claimed in the approximation theory, either by the existence or by constructions. To see this, consider the multi-layer neural networks. It is hard to conceive that the function, computed in practice via now-standard stochastic gradient descent (SGD) training procedure, is close to the one asserted by the universal approximation results. Secondly, in practice, researchers usually explore different model classes $\mathcal{F}$ to learn which representation best suits the data. For example, using different kernels machines, random forests, or specify certain architectures then run SGD on neural networks. In this case, strictly speaking, the choice of the model class depends on the data in an adaptive way, without prior knowledge about the basis. There have been substantial advances made to address the above two concerns — for instance, Jones [1992] on the first and Huang et al. [2008], Barron et al. [2008] on the second — for $\mathcal{F}$ being a linear span of a library of candidate functions (union of various set of basis that can be correlated), with greedy selection rules. Nevertheless, the current theory still falls short of describing the approximation and adaptivity for the non-convex and possibly non-smooth gradient descent training on all-layer weights of the neural networks, as done in practice.

We take a step to bridge the above mismatch in the current theory and practice for neural networks and to establish a theoretical framework where the model classes adapt to the data. In particular, we answer the following algorithmic approximation question:

Given data pair $(\mathbf{x},\mathbf{y})\sim P$ , denote $f_{*}(x)=\mathop{\mathbf{E}}[\mathbf{y}|\mathbf{x}=x]$ . Specify a neural networks model, and run gradient flow until any stationarity ( $t\rightarrow\infty$ ). Denote the computed function to be $f_{t}(x)$ . How does $f_{t}(x)$ relate to $f_{*}(x)$ , in terms of approximation and representation?

Also, we aim to formalize and shed light on the representation benefits of neural networks:

What are the provable benefits of the adaptive representation learned by training neural networks compared to the classical nonparametric pre-specified fixed basis representation?

The intimate connection between two-layer neural networks and reproducing kernel Hilbert spaces (RKHS) has been studied in the literature, see Rahimi and Recht [2008], Cho and Saul [2009], Daniely et al. [2016], Bach [2017], Jacot et al. [2018]. However, to the best of our knowledge, known results are mostly based on a fixed RKHS (in our notation $K_{0}$ in Section 5.1). In that sense, random features for kernel learning [Rahimi and Recht, 2008, 2009, Rudi and Rosasco, 2017] can be viewed as neural networks with fixed random sampled first layer weights, and tunable second layer weights. From the neural networks side, Rotskoff and Vanden-Eijnden [2018], Mei et al. [2018], Sirignano and Spiliopoulos [2019] study the mean-field theory for two-layer neural networks, and Jacot et al. [2018], Du et al. [2018], Chizat and Bach [2018], Ghorbani et al. [2019] study the linearization of neural networks around the initialization and draw connections to RKHS $\mathcal{K}_{0}$ in various over-parametrized settings. In contrast, we will establish a general theory with the dynamic and data-adaptive RKHS $\mathcal{K}_{t}$ obtained via training neural networks, with standard gradient flow on weights of both layers. Connections and distinctions to the literature that motivates our study are further discussed with details in Section 5. As a distinctive feature of the adaptive theory, we emphasize that all $f_{*}\in L^{2}(P_{x})$ is considered, without pre-specified structural assumptions.

1.1 Problem Formulation

In this paper, we consider the time-varying function $f_{t}$ to approximate $f_{*}$ , parametrized by a two-layer rectified linear unit (ReLU) neural network (NN).

[TABLE]

The time index $t$ corresponds to the evolution of parameters driven by the gradient flow/descent (GD) training dynamics. Here each individual pair $(w_{j}\in\mathbb{R},u_{j}\in\mathbb{R}^{d})$ in the summation is associated with a neuron. Consider the gradient flow as the training dynamics for the weights of the neurons: for the loss function $\ell(y,f)=(y-f)^{2}/2$ and the random variable $\mathbf{z}\mathrel{\mathop{\mathchar 58\relax}}=(\mathbf{x},\mathbf{y})$ , the parameters $(w_{j},u_{j})$ evolve with time as follows

[TABLE]

Equivalently, we can rewrite the function computed by NN at time $t$ as

[TABLE]

where $\tau_{t}=\sum_{j=1}^{m}w_{j}(t)\delta_{u_{j}(t)}$ is a signed combination of delta measures. We will define a careful rescaling of $\tau_{t}$ denoted as $\rho_{t}$ (Eqn. (5.8)), then derive the corresponding distribution dynamic for $\rho_{t}$ driven by the gradient flow later in Section 5.2. The rescaled formulation naturally extends to the infinite neurons case with $m\rightarrow\infty$ .

In this paper, by considering various distributions of $\mathbf{z}$ , we study two following problems: approximation and empirical risk minimization (ERM).

Function Approximation: The data pair $\mathbf{z}\sim P$ is sampled from the population joint distribution. We are going to answer how $f_{t}$ approximates $f_{*}(x)=\mathop{\mathbf{E}}[\mathbf{y}|\mathbf{x}=x]$ in function spaces, induced by the gradient flow on neuron weights

[TABLE]

Here we denote $\mu\mathrel{\mathop{\mathchar 58\relax}}=P_{x}$ , and remark that all $f_{*}\in L^{2}_{\mu}$ are considered without additional assumptions.

ERM and Interpolation: The data pair $\mathbf{z}\sim\frac{1}{n}\sum_{i=1}^{n}\delta_{\mathbf{x}=x_{i},\mathbf{y}=y_{i}}$ follows the empirical distribution. We will study gradient flow for the ERM

[TABLE]

In this case, the target reduces to $\widehat{\mathop{\mathbf{E}}}[\mathbf{y}|\mathbf{x}=x_{i}]=y_{i}$ with $\widehat{\mathop{\mathbf{E}}}$ as the empirical expectation. When the minimizer of Eqn. (1.6) achieves the zero loss, we call it the interpolation problem [Zhang et al., 2016, Belkin et al., 2018b, Ma et al., 2017, Liang and Rakhlin, 2018, Rakhlin and Zhai, 2018, Belkin et al., 2018a]. Here we are interested in when and how $f_{t}(x_{i})$ interpolates $y_{i}$ , for $1\leq i\leq n$ .

Finally, we remark that in practice, extending the gradient flow results to the (1) positive step size GD, and (2) mini-batch stochastic GD, are standalone interesting research topics. The reasons are that the optimization is non-smooth for the ReLU activation and that the interplay between the batch size and step size is less transparent in non-convex problems.

2 Preliminaries and Summary

2.1 Notations

We use the boldface lower case $\mathbf{x}$ to denote a random variable or vector. The normal letter $x$ can either be a scaler or a vector when there is no confusion. The transpose of a matrix $\mathbf{A}$ , resp. vector $u$ is denoted by $\mathbf{A}^{T}$ , resp. $u^{T}$ . $\mathbf{A}^{+}$ denotes the Moore–Penrose inverse. For $n\in\mathbb{N}$ , let $[n]\mathrel{\mathop{\mathchar 58\relax}}=\{1,\dots,n\}$ . We use $\mathbf{A}[i,j]$ to denote the $i,j$ -th entry of a matrix. We denote $\mathbbm{1}_{\mathcal{D}}$ as the indicator function of set $\mathcal{D}$ . We call symmetric positive semidefinite functions $K(\cdot,\cdot),H(\cdot,\cdot)\mathrel{\mathop{\mathchar 58\relax}}\,\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ kernels, and use calligraphy letter $\mathcal{K},\mathcal{H}$ to denote Hilbert spaces. We use $\langle f,g\rangle_{\mu}=\int f(x)g(x)\mu(dx)$ to denote the inner product in $L^{2}_{\mu}$ (or $L^{2}(P_{x})$ ). $\hat{\mu}$ denotes the empirical distribution for $\mu$ . Notation $\mathbf{E}_{\mathbf{x}}$ is the expectation w.r.t random variable $\mathbf{x}$ , and $\mathop{\mathbf{E}}_{\mathbf{x},\mathbf{\tilde{x}}}h(\mathbf{x},\mathbf{\tilde{x}})=\int\int h(x,\tilde{x})\mu(dx)\mu(d\tilde{x})$ . For a signed measure $\rho=\rho_{+}-\rho_{-}$ with the positive and negative parts, define $|\rho|=\rho_{+}+\rho_{-}$ .

2.2 Preliminaries

We use the signed measure $\rho_{t}$ , defined by the neuron weights at training time $t$ collectively, to construct a dynamic RKHS. The mathematical definition of $\rho_{t}$ is deferred to Section 5.1 and 5.2 (specifically, Eqn. (5.8)). The stationary signed measure at $t\rightarrow\infty$ is denoted as $\rho_{\infty}$ . For completeness we walk through the construction of the dynamic kernel and RKHS with $\rho_{t}$ . Define the linear operator $\mathcal{T}\mathrel{\mathop{\mathchar 58\relax}}L^{2}_{\mu}(x)\rightarrow L^{2}_{|\rho_{t}|}(\Theta)$ , such that for any $f(x)\in L^{2}_{\mu}(x)$

[TABLE]

One can define the adjoint operator $\mathcal{T}^{\star}\mathrel{\mathop{\mathchar 58\relax}}L^{2}_{|\rho_{t}|}(\Theta)\rightarrow L^{2}_{\mu}(x)$ , such that for $p(\Theta)\in L^{2}_{|\rho_{t}|}(\Theta)$ ,

[TABLE]

Note that both $\mathcal{T}$ and $\mathcal{T}^{\star}$ are compact operators under the finite total variation and compact support assumptions. For the finite neurons case (1.2), the operator is of finite rank. We define the compact integral operator $\mathcal{T}^{\star}\mathcal{T}$ with the corresponding kernel

[TABLE]

The dynamic RKHS $\mathcal{H}_{t}$ can be readily constructed via $H_{t}$ . Let the eigen decomposition of $\mathcal{T}^{\star}\mathcal{T}$ be the countable sum $\mathcal{T}^{\star}\mathcal{T}=\sum_{i=1}^{E}\lambda_{i}e_{i}e_{i}^{*}.$ Here $E$ can be a nonnegative integer or $\infty$ , and $\lambda_{i}>0$ . $e_{i}$ without confusion can represent either an eigen function or a linear functional. Similarly, we have the singular value decomposition for $\mathcal{T}=\sum_{i=1}^{E}\sqrt{\lambda_{i}}t_{i}e_{i}^{*}.$ and $\mathcal{T}^{\star}$ as well. For a detailed discussion, see e.g. Casselman [2014]. Again, $t_{i}$ is a function in $L^{2}_{|\rho_{t}|}(\Theta)$ or a linear functional. The RKHS can be specified as follows.

[TABLE]

We refer to $H_{\infty}$ as the stationary RKHS kernel, and $\mathcal{H}_{\infty}$ as the stationary RKHS. One can view that the gradient flow training dynamics — on the parameters of NN — induces a sequence of functions $\{f_{t}\mathrel{\mathop{\mathchar 58\relax}}t\geq 0\}$ and dynamic RKHS $\{\mathcal{H}_{t}\mathrel{\mathop{\mathchar 58\relax}}t\geq 0\}$ , indexed by the time $t$ .

2.3 Organization and Summary

We will prove three results, which are summarized informally in this section (see also Table 1). We remark that Theorems 3.1 and 3.2 are stated for the approximation problem. However, as done in Corollary 3.1, by substituting $\mathcal{P},\mu$ by the empirical counterparts, one can easily state the analog for the ERM problem. Recall $f_{*}(x)=\mathop{\mathbf{E}}[\mathbf{y}|\mathbf{x}=x]$ .

Gradient flow on NN converges to projection onto data-adaptive RKHS. Theorem 3.1 shows that as done in practice training NN with simple gradient flow, in the limit of any local stationarity, learns the adaptive representation, and performs the global least squares projection simultaneously. Define $f_{\infty}=\lim_{t\rightarrow\infty}f_{t}$ as the function computed by ReLU networks (defined in (1.2), or more generally in (5.9)) until any stationarity of the gradient flow dynamics (defined in (1.3), with the squared loss) for the population distribution $(\mathbf{x},\mathbf{y})\sim P$ . Define the corresponding stationary RKHS $\mathcal{H}_{\infty}=\lim_{t\rightarrow\infty}\mathcal{H}_{t}$ (defined in (2.1)).

[Informal version of Thm. 3.1] Consider $f_{*}\in L^{2}_{\mu}$ , for any local stationarity of the gradient flow dynamics (1.3) on the weights of neural networks (1.2), the function computed by NN at stationarity $f_{\infty}$ satisfies

[TABLE]

Representation benefits of data-adaptive RKHS. Theorem 3.2 illustrates the provable benefits of the learned data-adaptive representation/basis $\mathcal{H}_{\infty}$ . We emphasize that $\mathcal{H}_{\infty}$ , as obtained by training neural networks on the data $(\mathbf{x},\mathbf{y})\sim P$ , depends on the data in an implicit way such that there are advantages of representing and approximating $f_{*}$ .

[Informal version of Thm. 3.2] Consider $f_{*}\in L^{2}_{\mu}$ and the same setup as Theorem 3.1. Decompose $f_{*}$ into the function $f_{\infty}$ computed by the neural network and the residual $\Delta_{\infty}$

[TABLE]

Then there is another RKHS (defined in (3.4)) $\mathcal{K}_{\infty}\supset\mathcal{H}_{\infty}$ , such that

[TABLE]

with a gap in the spaces $\mathcal{H}_{\infty}\oplus\text{Ker}(\mathcal{K}_{\infty})\neq L^{2}_{\mu}$ .

Convergence to Ridgeless regression with adaptive kernels. Proposition 4.1 establishes that in the vanishing regularization $\lambda\rightarrow 0$ limit, the neural network function computed by gradient flow converges to the kernel ridgeless regression with an adaptive kernel (denoted as $\widehat{f}^{\rm rkhs}_{\infty}(x)$ ). Consider using the gradient flow on the weights of the neural network function $f_{t}(x)=\sum_{j=1}^{m}w_{j}(t)\sigma(x^{T}u_{j}(t))$ , to solve the $\ell_{2}$ -regularized ERM

[TABLE]

Denote the function computed by NN at any local stationarity of ERM as $\widehat{f}^{{\rm nn},\lambda}(x)$ , we answer the extrapolation question at a new point $x$ , with the generalization error discussed in Prop. 4.2. The result is extendable to the infinite neurons case.

[Informal version of Prop. 4.1] Consider only the bounded assumption on initialization that $|w_{j}^{2}(0)-\|u_{j}\|^{2}(0)|<\infty$ for all $1\leq j\leq m$ . At stationarity, denote the corresponding adaptive kernel as $\widehat{H}_{\infty}^{\lambda}$ . The neural network function $\widehat{f}^{{\rm nn},\lambda}_{\infty}(x)$ has the following expression,

[TABLE]

3 Main Results: Benefits of Adaptive Representation

We formally state two main results of the paper, Theorem 3.1 and Theorem 3.2 below.

3.1 Gradient Flow, Projection and Adaptive RKHS

We study how the function $f_{t}$ computed from gradient flow on NN represents $f_{*}$ when reaching any stationarity, under the squared loss. Consider the gradient flow dynamics (5.3) reaching any stationarity. Assume that the corresponding signed measure in (5.8) satisfies $\text{TV}(\rho_{\infty})<\infty$ with a compact support. The mathematical details about $\rho_{\infty}$ are postponed to Section 5.2. We employ the notation $\rho_{\infty}$ since reaching stationarity can be viewed as $t\rightarrow\infty$ .

We would like to emphasize that this stationary signed measure $\rho_{\infty}$ is task adaptive: it implicitly depends on the regression task $f_{*}$ and the data distribution $P$ , rather than being pre-specified by the researcher as in Bach [2017], Daniely et al. [2016], Cho and Saul [2009]. With the RKHS established in Section 2.2, we are ready to state the following theorem.

Theorem 3.1 (Approximation).

For any conditional mean $f_{*}(x)=\mathop{\mathbf{E}}[\mathbf{y}|\mathbf{x}=x]\in L^{2}_{\mu}$ , consider solving the approximation problem (1.5), with the ReLU NN function $f_{t}$ defined in (1.2) where $w_{j}(t)$ and $\theta_{j}(t)$ are the weights for $t\geq 0,1\leq j\leq m$ . For any signed measure $\rho_{0}$ with ${\rm TV}(\rho_{0})<\infty$ , consider the infinitesimal initialization weights $\ u_{j}(0)=\Theta_{j}/\sqrt{m}$ , and $w_{j}(0)={\rm sgn}(\rho_{0}(\Theta_{j}))\|\Theta_{j}\|/\sqrt{m}$ , with $\Theta_{j}\sim\rho_{0}$ sampled independently. When the training dynamics (1.3) reaches any stationarity, it defines a stationary signed measure $\rho^{(m)}_{\infty}$ (on the collective weights) with ${\rm TV}(\rho^{(m)}_{\infty})<\infty$ , and a corresponding stationary RKHS $\mathcal{H}_{\infty}$ with the kernel defined in Eqn. (2.1), such that:

the function computed by neural networks at stationarity has the form

[TABLE] 2. 2.

$f_{\infty}$ * is a global minimizer of approximating $f_{*}$ within the RKHS $\mathcal{H}_{\infty}$ *

[TABLE]

In addition, the same results extend to the infinite neurons case with $m\rightarrow\infty$ where the limit for $\rho^{(m)}_{\infty}$ can be defined in the weak sense.

Remark 3.1.

The above theorem shows that $\lim_{t\rightarrow\infty}f_{t}$ obtained by training on two-layer weights over time until any stationarity, is the same as projecting $f_{*}$ onto the stationary RKHS $\mathcal{H}_{\infty}$ . The projection is the solution to the classic nonparametric least squares, had one known the adaptive representation $\mathcal{H}_{\infty}$ beforehand. Conceptually, this is distinct from the theoretical framework in the current statistics and learning theory literature: we do not require the structural knowledge about $f_{*}$ (say, smoothness, sparsity, reflected in $\mathcal{F}$ ). Instead, we run gradient descent on neural networks to learn an adaptive representation for $f_{*}$ , and show how the computed function represents $f_{*}$ in this adaptive RKHS $\mathcal{H}_{\infty}$ .

In other words, as done in practice training NN with simple gradient flow, in the limit of any local stationarity, learns the adaptive representation, and performs the global least-squares projection simultaneously. Training NN is learning a dynamic representation (quantified by $\mathcal{H}_{t}$ ), at the same time updating the predicted function $f_{t}$ , as shown in Fig. 1.

A final note on the infinite neuron case: for any fixed time $t$ , with the proper random initialization, setting $m\rightarrow\infty$ defines a proper distribution dynamics on the weak limit $\rho_{t}$ shown in Lemma 5.3. Then set $t\rightarrow\infty$ to obtain the stationarity RKHS $\mathcal{H}_{\infty}$ .

From the above, we have the following natural decomposition,

[TABLE]

Surprisingly, as we show in the next section, $\Delta_{\infty}$ actually lies in a smaller subspace of $\text{Ker}(\mathcal{H}_{\infty})$ , characterized by $\text{Ker}(\mathcal{K}_{\infty})$ . We call this the representation and approximation benefits of the data-adaptive RKHS learned by training neural networks.

Before moving next, we briefly discuss the above theorem when applied to the empirical measure, to solve the ERM problem. First, as a direct corollary, the following holds.

Corollary 3.1 (ERM).

Consider the ERM problem (1.6), with the other settings the same as in Theorem 3.1. One can define the finite dimensional RKHS $\widehat{\mathcal{H}}_{\infty}$ (at most rank $n$ ) as in (2.1) with $\widehat{\mu}=\frac{1}{n}\sum_{i=1}^{n}\delta_{x_{i}}$ substituting $\mu$ . When reaches any stationarity, the solution satisfies

[TABLE]

More importantly, we will show in Proposition 4.1 that the function computed by training neural networks with gradient descent on the empirical risk objective $\widehat{f}_{\infty}(x)$ until any stationarity (with vanishing $\ell_{2}$ regularization), can be shown to be the kernel ridgeless regression with the data-adaptive RKHS $\widehat{\mathcal{H}}_{\infty}$ . Hence, studying the out of sample performance for GD on NN reduces to the generalization of kernel ridgeless regression with adaptive kernels.

3.2 Representation Benefits of Adaptive RKHS

We now define another adaptive RKHS $\mathcal{K}_{\infty}$ named as the GD kernel, which turns out to be different from $\mathcal{H}_{\infty}$ in (2.1). Interestingly, the difference in these two kernels sheds light on the representation benefits of the adaptive RKHS. The new RKHS $\mathcal{K}_{\infty}$ is motivated by the gradient training dynamics. Recall the associated signed measure $\rho_{\infty}$ at the stationarity, The GD kernel is defined as

[TABLE]

which is different than the stationary RKHS kernel $H_{\infty}$ in (2.1). We use $\mathcal{K}_{t}\mathrel{\mathop{\mathchar 58\relax}}L^{2}_{\mu}(x)\rightarrow L^{2}_{\mu}(x)$ to denote the integral operator associated with $K_{t}$ ,

[TABLE]

With a slight abuse of notation, we denote the corresponding RKHS to be $\mathcal{K}_{t}$ as well. Now we are ready to state the main theorem on the representation benefits.

Theorem 3.2 (Representation Benefits).

Consider $f_{*}\in L^{2}_{\mu}$ and the same setting as in Theorem 3.1. Consider the approximation problem (1.5) with either finite or infinite neurons, and the gradient flow dynamics (5.3) (equivalently (1.3)) with data pair $(\mathbf{x},\mathbf{y})\sim P$ drawn from the population distribution. When reaching any stationary signed measure $\rho_{\infty}$ , $f_{*}$ is decomposed into the function $f_{\infty}$ computed by the neural network and the residual $\Delta_{\infty}$

[TABLE]

Recall the RKHS $\mathcal{H}_{\infty}$ in (2.1) and the GD RKHS $\mathcal{K}_{\infty}$ in (3.4), all learned from the data $(\mathbf{x},\mathbf{y})\sim P$ and $f_{*}$ adaptively. The following holds,

[TABLE]

with $\mathcal{H}_{\infty}\oplus\text{Ker}(\mathcal{K}_{\infty})\neq L^{2}_{\mu}$ . In other words, GD on NN decomposes $f_{*}$ into two parts, and each lies in a space that is NOT the orthogonal complement to the other.

Remark 3.2.

As we can see $\text{Ker}(K_{\infty})$ and $\text{Ker}(H_{\infty})$ are not the same. Therefore, the decomposition $f_{\infty}+\Delta_{\infty}$ is not a trivial orthogonal decomposition to the RKHS $\mathcal{H}_{\infty}$ and its complement.

Recall Theorem 3.1, projecting $f_{*}$ to the RKHS $\mathcal{H}_{\infty}$ with the data-adaptive kernel

[TABLE]

associated with $|\rho_{\infty}|$ is the same as the function constructed by neural networks (GD limit as $t\rightarrow\infty$ ). However, the residual lies in a possibly much smaller space due to Theorem 3.2, which is the null space of the RKHS $\mathcal{K}_{\infty}$

[TABLE]

In other words, as the learned adaptive basis $\mathcal{H}_{\infty}$ (from GD) depends on the data distribution and the task $f_{*}$ implicitly, it has the advantage of representing $f_{*}$ by squeezing the residual into a smaller subspace in the null space of $\mathcal{H}_{\infty}$ . A pictural illustration can be found in Fig. 2. This representation and approximation benefit helps with explaining the better interpolation results obtained by neural networks [Zhang et al., 2016, Belkin et al., 2018b, Liang and Rakhlin, 2018, Belkin et al., 2018a]: (1) the adaptive basis is tailored for the task $f_{*}$ , thus the residual/interpolation error lies in a smaller space; (2) in view of the ODE in Corollary 5.2, the second layer of NN adds implicit regularization to the smallest eigenvalues of $K_{t}$ , thus improving the converging speed of $\Delta_{t}$ to zero.

Before concluding this section, we remark that a similar result holds for the ERM problem (1.6). As we shall discuss in the next section, the gap between $\mathcal{H}_{\infty}$ and $\mathcal{K}_{\infty}$ can be large, even for the ERM problem.

4 Implications of the Adaptive Theory

In this section, we will discuss some direct implications of the adaptive kernel theory for neural networks established in this paper.

Example: Gap in Spaces $\mathcal{H}_{\infty}$ and $\mathcal{K}_{\infty}$ .

In Theorem 3.2, it is established that $\text{Ker}(\mathcal{K}_{\infty})\subset\text{Ker}(\mathcal{H}_{\infty})$ . We now construct a concrete case to illustrate the potentially significant gap in these two spaces as follows. Consider only one neuron with $m=1$ , solving ERM problem (1.6) with $n$ samples, and $\mathbf{x}$ with dimension $d$ . In this case, $\rho_{\infty}$ is supported on only one point, noted as $\Theta_{\infty}\in\mathbb{R}^{d}$ . Denote $X\in\mathbb{R}^{n\times d}$ as the data matrix, one can show that

[TABLE]

has rank $1$ . In contrast,

[TABLE]

can be of rank $d\wedge|\{i\mathrel{\mathop{\mathchar 58\relax}}x_{i}^{T}\Theta_{\infty}\geq 0\}|$ . Hence the null space of $K_{\infty}$ is much smaller than that of $H_{\infty}$ . The gap can be large for many other settings of $(n,m,d)$ .

Connections to Min-norm Interpolation.

The following result establishes the connections between the solution of gradient descent on neural networks (at local stationarity), and the kernel ridgeless regression [Belkin et al., 2018b, Liang and Rakhlin, 2018, Hastie et al., 2019] with an adaptive kernel $\widehat{H}^{\lambda}_{\infty}$ . Empirical evidence on the similarity between the interpolation with kernels and neural networks was discovered in Belkin et al. [2018b]. The following proposition provides a novel way of studying the generalization property of neural networks via adaptive kernels.

Proposition 4.1 (Interpolation: Connection to Kernel Ridgeless Regression).

Consider the gradient flow dynamics on all the weights of the neural network function $f_{t}(x)=\sum_{j=1}^{m}w_{j}(t)\sigma(x^{T}u_{j}(t))$ , to solve the $\ell_{2}$ -regularized ERM

[TABLE]

Consider only the bounded assumption on initialization that $|w_{j}^{2}(0)-\|u_{j}\|^{2}(0)|<\infty$ for all $1\leq j\leq m$ . At stationarity, denote the signed measure as $\widehat{\rho}^{\lambda}_{\infty}$ and the corresponding adaptive kernel as $\widehat{H}_{\infty}^{\lambda}$ . Then the neural network function at stationarity $\widehat{f}^{{\rm nn},\lambda}_{\infty}(x)$ satisfies,

[TABLE]

In the vanishing regularization $\lambda\rightarrow 0$ limit, the neural network function converges to the kernel ridgeless regression with the adaptive kernel, when $\widehat{H}_{\infty}(X,X)\mathrel{\mathop{\mathchar 58\relax}}=\lim_{\lambda\rightarrow 0}\widehat{H}_{\infty}^{\lambda}$ exists,

[TABLE]

Note that the generalization theory for the kernel ridgeless regression has been established Liang and Rakhlin [2018], Hastie et al. [2019]. Here the kernel $\widehat{H}_{\infty}(X,X)$ is data-adaptive (that adapts to $f_{*}$ ) learned along training, instead of being fixed and pre-specified.

Connections to Random Kitchen Sinks.

Let us introduce two function spaces, with the base measure $\rho_{0}$ (fixed representation)

[TABLE]

In random kitchen sinks studied in Rahimi and Recht [2008, 2009], by assuming $f_{*}\in\Gamma_{2}(\rho_{0})$ that lies in the RKHS, the approximation error can be controlled by the existence of the following function with $\theta_{j},j\in[m]$ i.i.d. sampled from $\rho_{0}$

[TABLE]

Note that $\widehat{f}$ lies in a possibly much larger space $\Gamma_{1}(\rho_{0})$ though the target only lies in $f_{*}\in\Gamma_{2}(\rho_{0})$ . Similarly for two-layer neural networks function $f_{t}(x)$ considered in [Bach, 2017, Section 2.3], the RKHS space $\Gamma_{2}(\rho_{0})$ can be more restrictive compared to $f_{t}\in\Gamma_{1}(\rho_{0})$ .

In contrast, with the adaptive RKHS representation $\mathcal{H}_{\infty}$ , we have shown that

[TABLE]

The extreme case of fully adaptive function space $\Gamma_{2}(|\rho_{*}|)$ is defined with $\rho_{*}$ tailored for $f_{*}$ , $f_{*}=\int\sigma(x^{T}\Theta)\rho_{*}(d\Theta)$ . The adaptive representation learned by neural networks can be viewed as in between the fixed and the fully adaptive representation.

Adaptive Generalization Theory.

Now we attempt to provide a new decomposition to study the generalization of NN via adaptive kernels. Recall we have shown that $\widehat{f}^{\rm rkhs}_{\infty}(x)=\lim_{\lambda\rightarrow 0}\widehat{f}^{{\rm nn},\lambda}_{\infty}(x)=\widehat{H}_{\infty}(x,X)\widehat{H}_{\infty}(X,X)^{+}Y,$ where $\widehat{H}_{\infty}(x,\tilde{x})\mathrel{\mathop{\mathchar 58\relax}}=\int\sigma(x^{T}\Theta)\sigma(\tilde{x}^{T}\Theta)\widehat{\rho}^{(n,m)}_{\infty}(d\Theta).$ Define the population limit $\rho^{(m)}_{\infty}(d\Theta)\mathrel{\mathop{\mathchar 58\relax}}=\lim_{n\rightarrow\infty}\widehat{\rho}^{(n,m)}_{\infty}$ and $H_{\infty}(x,\tilde{x})\mathrel{\mathop{\mathchar 58\relax}}=\int\sigma(x^{T}\Theta)\sigma(\tilde{x}^{T}\Theta)\rho^{(m)}_{\infty}(d\Theta)$ . Denote the ridgeless regression with the population adaptive kernel $H_{\infty}$ ,

[TABLE]

Assume $(\mathbf{y}-f_{*}(\mathbf{x}))^{2}\leq\sigma^{2}$ a.s. (can be relaxed). One can derive the following decomposition for generalization.

Proposition 4.2 (Adaptive Generalization).

[TABLE]

Note this result holds without requiring global optimization guarantees. The first term is the representation error, which corresponds to the closeness of the adaptive RKHS $\widehat{\mathcal{H}}_{\infty}$ (using empirical distribution) and $\mathcal{H}_{\infty}$ (using population distribution). The second term is the adaptive approximation error studied in the current paper. The third and fourth terms are the variance and bias expressions studied in Liang and Rakhlin [2018], Hastie et al. [2019], Rakhlin and Zhai [2018], as if assuming the actual function lies in $\mathcal{H}_{\infty}$ . This decomposition suggests the possibility of studying generalization without explicit global understanding of the optimization, and providing rates that adapts to $f_{*}$ without structural assumptions.

5 Time-varying Kernels and Evolution

In this section, we lay out the mathematical details on the time-varying kernels and the evolution of the signed measure $\rho_{t}$ supporting the main results. In the meantime, we will discuss in depth the relevant literature motivating our proof ideas.

First, we describe the motivation behind the dynamic RKHS $\mathcal{K}_{t}$ , and the GD kernel induced by the gradient descent dynamics. Extensions to multi-layer perceptrons is in Sec. A.2.

Lemma 5.1 (Dynamic kernel of finite neurons GD).

Consider the approximation problem (1.1) with a neural network function (1.2), and the training process (1.3) with population distribution. Let $\Delta_{t}(x)=f_{*}(x)-f_{t}(x)$ be the residual. Define the time-varying kernel $K_{t}(\cdot,\cdot)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ ,

[TABLE]

Then the residual $\Delta_{t}$ driven by the GD dynamics satisfies,

[TABLE]

When running GD to solve the empirical risk minimization (ERM), the dynamics of the finite-dimensional sample residual $\|\Delta_{t}\|_{\hat{\mu}}^{2}$ has been established in Jacot et al. [2018], Du et al. [2018]. Here we generalize the result to optimize the weights of both layers, and to solve the infinite-dimensional population approximation problem rather than the empirical risk minimization problem. For a general loss function $\ell(y,f)$ with curvature (say, logistic loss), similar results hold under slightly stronger conditions.

Corollary 5.1.

Consider a general loss function $\ell(y,f)$ that is $\alpha$ -strongly convex in the second argument $f$ , with $K_{t}$ defined in (5.1). Assume in addition $\frac{1}{n}K_{t}(X,X)\in\mathbb{R}^{n\times n}$ has smallest eigenvalue $\lambda_{t}>0$ . Define $\Delta_{t}(x_{i})\mathrel{\mathop{\mathchar 58\relax}}=\frac{\partial\ell(y_{i},f_{t}(x_{i}))}{\partial f}$ , then we have for all $f_{*}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\rightarrow\mathbb{R}$ ,

[TABLE]

5.1 Initialization, Rescaling and $K_{0}$

Now we describe the initialization and rescaling schemes used in the main theorems. Rewrite (1.1) according to the signs of the second layer weights

[TABLE]

Initialization.

We consider the “infinitesimal” initialization drawn $i.i.d.$ from two probability measures $\rho_{+,0}$ and $\rho_{-,0}$ that do not depend on $m$ :

[TABLE]

Here $m=m_{+}+m_{-}$ with $m_{+}\asymp m_{-}$ . The $1/\sqrt{m}$ rescaling factor turns out to be crucial when defining the infinite neurons limit for the evolution of signed measures. Remark that such initialization is w.l.o.g., and accounts for the infinitesimal nature used in practice when the number of neurons grows. For the second layer weights, we impose the “balanced condition” motivated by Maennel et al. [2018],

[TABLE]

It turns out that with such initialization, the balanced condition holds throughout the training process induced by gradient flow, which is useful for the main theorems. Interestingly, in the proof of Proposition 4.1, we show that such balanced condition always holds at stationarity when training neural networks with $\ell_{2}$ regularization, even for unbalanced initialization.

Proposition 5.1 (Balanced condition).

For $u_{+,j}(t)$ , $u_{-,j}(t)$ , $w_{+,j}(t)$ and $w_{-,j}(t)$ , and the initialization specified above, at any time $t$ , we have

[TABLE]

Rescaling.

To prepare for the distribution dynamic theory in the next section, we introduce a parameter rescaling with the $\sqrt{m}$ factor. Let $\theta_{+,j}(t)=\sqrt{m}w_{+,j}(t)$ and $\theta_{-,j}(t)=\sqrt{m}w_{-,j}(t)$ , also define $\Theta_{+,j}(t)=\sqrt{m}u_{+,j}(t)$ and $\Theta_{-,j}(t)=\sqrt{m}u_{-,j}(t)$ sampled from $\rho_{+,0}$ and $\rho_{-,0}$ at $t=0$ . Under this representation,

[TABLE]

By the positive homogeneity of ReLU, we have the corresponding dynamics on the rescaled parameters,

[TABLE]

Define at time $t$

[TABLE]

as the empirical distribution over neurons on the parameter space $\Theta$ . The $\rho_{+,t}$ and $\rho_{-,t}$ converge weakly to proper distributions in the infinite neurons limit $m\rightarrow\infty$ , see e.g. Bach [2017], Mei et al. [2018]. Through the balanced condition in Proposition 5.1 and Proposition A.1, we know (by substituting $\theta_{j}$ by $\|\Theta_{j}\|$ )

[TABLE]

The above motivates the study of the RKHS $\mathcal{H}_{t}$ as in Theorem 3.1, with the kernel

[TABLE]

To conclude this section, we provide the explicit formula for the initial kernel matrix $K_{0}$ under such infinitesimal random initialization. Specifically, consider the initialization with $w_{j}$ being $\pm 1/\sqrt{m}$ with equal chance and $u_{i}\sim N(\mathbf{0},1/m\cdot\mathbf{I}_{d})$ $i.i.d.$ sampled. The initial kernel $K_{0}$ has the following expression, in the infinite neurons limit.

Lemma 5.2 (Fixed Kernel).

With initialization specified above, consider w.l.o.g. $\|x\|=\|\tilde{x}\|=1$ , and denote $\Theta\sim\pi$ as the isotropic Gaussian $N(\mathbf{0},\mathbf{I}_{d})$ . By the strong law of large number, we have almost surely,

[TABLE]

Much known results [Bengio et al., 2006, Rahimi and Recht, 2008, Bach, 2017, Cho and Saul, 2009, Daniely et al., 2016] on the connection between RKHS and two-layer NN focus on some fixed kernel, such as $K_{0}$ . To instantiate useful statistical rates, one requires $f_{*}$ to lie in the corresponding pre-specified RKHS $\mathcal{K}_{0}$ , which is non-verifiable in practice. In contrast, the dynamic kernel is less studied. We will establish a dynamic and adaptive kernel theory defined by GD, without making any structural assumptions on $f_{*}$ other than $f_{*}\in L^{2}_{\mu}$ .

5.2 Evolution of $\rho_{t}$

In this section, we derive the evolution of the signed measure $\rho_{t}$ defined by the neurons at the training $t$ , which in turn determines the dynamic kernel $K_{t}$ defined in (5.1). To generalize the result to the case of infinite neurons, we follow and borrow tools from the mean-field characterization [Mei et al., 2018, Rotskoff and Vanden-Eijnden, 2018, Jordan et al., 1998]. The rescaling described in the previous section proves handy when defining such infinite neurons limit. We define the velocity field driven by the regression task and the interaction among neurons,

[TABLE]

The following theorem casts the training process as distribution dynamics on $\rho_{+,t},\rho_{-,t}$ .

Lemma 5.3 (Dynamic Kernel and Evolution).

Consider the approximation problem (1.1), and the gradient flow as the training dynamic (1.3). For $\rho_{+,t}$ , $\rho_{-,t}$ and $\rho_{t}$ defined in (5.8) with possibly infinite neurons, we have the following PDE characterization on distribution dynamics of $\rho_{+,t},\rho_{-,t}$

[TABLE]

Moreover, the GD kernel $K_{t}$ is defined as

[TABLE]

Remark 5.1.

As in Mei et al. [2018], Rotskoff and Vanden-Eijnden [2018], let’s first show that in the infinite neurons limit $m\rightarrow\infty$ , $\rho_{+,t},\rho_{-,t}$ are properly defined, with Eqn. (5.3) characterizing the distribution dynamics. For simplicity, we assume the initialization $\rho_{+,0},\rho_{-,0}$ is with bounded support. Add the superscript $m$ , $\rho_{+,t}^{(m)},\rho_{-,t}^{(m)},\rho_{t}^{(m)}$ to (5.8) to indicate their dependence on $m$ . Consider that $\nabla_{\Theta}V(\Theta)$ , $\nabla_{\Theta}U(\Theta,\tilde{\Theta})$ in (5.11) are bounded and uniform Lipchitz continuous as in [Mei et al., 2018, A3]. With the same proof as in [Mei et al., 2018, Theorem 3], one can show that with $m\rightarrow\infty$ , the initial distribution $\rho^{(m)}_{0}\xrightarrow{d}\rho_{0}=\rho_{+,0}-\rho_{-,0}$ by law of large number. And by the solution’s continuity w.r.t. the initial value, we have $\rho_{t}^{(m)}\xrightarrow{d}\rho_{t}$ as $m\rightarrow\infty$ well defined, for any fixed $t$ .

Note that our problem setting is slightly different from that in Mei et al. [2018], where the authors consider the NN with fixed second layer weights to be $1/m$ . We reiterate that the re-parameterization via $\theta$ and $\Theta$ is crucial: (1) weights on both layers are optimized following the gradient flow; (2) infinitesimal random initialization is employed in practice. In the setting of [Mei et al., 2018, Eqn. (3)], the training process is slightly different from the vanilla GD on weights, with an additional $m$ factor in the velocity term. This subtlety is also mentioned in Rotskoff and Vanden-Eijnden [2018]. In short, the rescaling looks at the dynamics where $\Theta$ ’s are on the invariant scale as $m\rightarrow\infty$ for any fixed effective time $t$ (that does not depend on $m$ ). Here we analyze the exact gradient flow on the two-layer weights, with infinitesimal random initialization as in practice, resulting in a different velocity field (5.11) compared to that in Mei et al. [2018].

The proof of Theorem 3.1 makes use of (5.9)-(5.10) and the stationary condition implied by Lemma 5.3. The balanced condition is crucial in both Theorem 3.1 and Proposition 4.1. The details of the proof are deferred to Section 7.

5.3 Two RKHS: $\mathcal{K}_{\infty}$ and $\mathcal{H}_{\infty}$

In this section we compare the two adaptive RKHS appeared $\mathcal{K}_{\infty}$ in (5.13), and $\mathcal{H}_{\infty}$ in (5.10). The comparison will lead to the proof of Theorem 3.2. We start with generalizing Lemma 5.1 with the possibly infinite neurons case via the distribution dynamics in (5.3).

Corollary 5.2.

Consider the same setting as in Lemma 5.1 with possibly infinite neurons NN (5.9), and the training process (5.3). Define the time-varying kernel matrix $K_{t}(\cdot,\cdot)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ , with the signed measure $\rho_{t}$ follows (5.3)

[TABLE]

Then we still have $d\mathbf{E}_{\mathbf{x}}\left[\frac{1}{2}\Delta_{t}(\mathbf{x})^{2}\right]/dt=-\mathbf{E}_{\mathbf{x},\mathbf{\tilde{x}}}\left[\Delta_{t}(\mathbf{x})K_{t}(\mathbf{x},\mathbf{\tilde{x}})\Delta_{t}(\mathbf{\tilde{x}})\right].$

It turns out that the kernels $K_{\infty}$ and $H_{\infty}$ , defined in (3.4) and (2.1) respectively, satisfy the following inclusion property.

Proposition 5.2.

Consider the training process reaches any stationarity $\rho_{\infty}=\rho_{+,\infty}-\rho_{-,\infty}$ with compact support within radius $D$ and finite total variation. We have

[TABLE]

with $K_{\infty}^{(0)},K_{\infty}^{(1)}$ defined in (5.14). Combining with the fact that $H_{\infty}\neq K_{\infty}$ implies

[TABLE]

The proof of Theorem 3.2 uses the following fact: when reaching stationarity, due to the ODE defined by GD in Lemma 5.1, the residual must satisfy

[TABLE]

The proof of Proposition 5.2 and Theorem 3.2 are deferred to Section 7.

6 Experiments

We run experiments to illustrate the spectral decay of the dynamic kernels defined in $K_{t}$ over time $t$ . The exercise is to quantitatively showcase that during neural network training, one does learn the data-adaptive representation, which is task-specific depending on the true complexity of $f_{*}$ . The training process is the same as the one we theoretically analyze: vanilla gradient descent on a two-layer NN of $m$ neurons, with infinitesimal random initialization scales as $1/\sqrt{m}$ .

The first experiment is a synthetic exercise with well-specified models. We generate $\{x_{i}\}_{i=1}^{50}$ from isotropic Gaussian in $\mathbb{R}^{5}$ , and $y_{i}=f_{*}(x_{i})=\sum_{j=1}^{J}w^{*}_{j}\sigma(x_{i}^{T}u^{*}_{j})$ with different $J$ . In other words, we choose different target $f_{*}$ (task complexity) by varying $J$ . We select $m=500$ in our experiment. The top $80\%$ of the sorted eigenvalues of the kernel matrix $K_{t}$ along the GD training process are shown in Fig. 3. The $x$ -axis is the index of eigenvalues in descending order, and the $y$ -axis is the logarithmic values of the corresponding eigenvalues. Different color indicates the spectral decay of the $K_{t}$ at different training time $t$ . The eigenvalue-decays stabilize over time $t$ means that the training process approaches stationarity. As we can see with $f_{*}$ belongs to the NN family, the eigenvalues of the kernel matrix, in general, become larger during the training process. For a more complicated target function, it takes longer to reach stationarity.

The second experiment is another synthetic test on fitting random labels. We generate $\{x_{i}\}_{i=1}^{50}$ from isotropic Gaussian in $\mathbb{R}^{5}$ , as $y_{i}$ takes $\pm 1$ with equal chance. We select $m=200,500$ , and $n=50,200$ to investigate those parameters’ influence on the kernel $K_{t}$ . We want to point out two observations. First, fixed $n$ , we investigate over-parametrized models ( $m=200,500$ large). Shown from Fig. 4 along the row, the kernels for different $m$ ’s behave much alike. In other words, in the infinite neurons limit, the kernel will stabilize. Second, fixed $m$ , we vary the number of samples $n$ , to simulate different interpolation hardness. As seen from Fig. 4 along the column, the kernels and the convergence over time are distinct, reflecting the different difficulty of the interpolation.

The third experiment (Fig. 5) is regression using the MNIST dataset with different sample size $n=50,200$ . We hope to investigate the influence of sample size on the kernel matrix along the training process. For a larger sample size $N$ , it takes longer to reach stationarity.

7 Main Proofs

Proof of Theorem 3.1.

From the definition, we have $\mathcal{T}^{*}p\in\mathcal{H}_{\infty}$ for any $p\in L^{2}_{|\rho_{\infty}|}$ , and $\mathcal{T}^{*}$ is a surjective mapping. Suppose that $\widehat{g}\in\mathcal{H}_{\infty}$ is a minimizer of (3.2), then we claim that for any $p\in L^{2}_{|\rho_{\infty}|}$ , one must have

[TABLE]

This claim can be seen from the following argument. Suppose not, then for $p$ that violates the above, construct

[TABLE]

we know

[TABLE]

For $\epsilon$ with the same sign as $\langle f_{*}-\widehat{g},\mathcal{T}^{*}p\rangle_{\mu}\neq 0$ and small enough, one can see that $\|f_{*}-\widehat{g}_{\epsilon}\|_{\mu}^{2}<\|f_{*}-\widehat{g}\|_{\mu}^{2}$ which validates that $\widehat{g}$ is a minimizer. From the same argument, one can see that $\widehat{g}$ is a minimizer if and only if (7.1) holds, in other words,

[TABLE]

From PDE characterization (5.3) with ReLU activation, one knows that

[TABLE]

and the expression for the velocity field

[TABLE]

We know that any stationary point $\left(\rho_{+,\infty},\rho_{-,\infty}\right)$ has the following property [Mei et al., 2018]:

[TABLE]

Multiplying both sides by $\|\Theta\|\Theta^{T}$ and recall the property of ReLU, the above condition implies that for all $\Theta\in\text{supp}(\rho_{\infty})$ , we have

[TABLE]

One can see the stationary condition on $\rho_{\infty}$ (fixed points of the dynamics) (7.5) translates to

[TABLE]

Here the function $\frac{d\rho_{\infty}}{d|\rho_{\infty}|}$ is the Radon-Nikodym derivative. In addition, one can easily verify that, as $\rho_{\infty}$ has bounded total variation

[TABLE]

Therefore, combining all the above, one knows that

[TABLE]

and that for any $p\in L^{2}_{|\rho_{\infty}|}$

[TABLE]

We have proved that $f_{\infty}=\mathcal{T}^{\star}\frac{d\rho_{\infty}}{d|\rho_{\infty}|}$ satisfies normal condition for being a minimizer to (3.2). ∎

Proof of Proposition 5.2.

The first inequality in (5.16) is trivial. For the second inequality, it suffices to show for any $c=(c_{1},\dots,c_{p})^{T}$ , $x_{1},\dots,x_{p}$ , $\Theta$ , we have

[TABLE]

The RHS equals

[TABLE]

For the last inequality, with compactness condition on $\rho_{\infty}$ , we have

[TABLE]

Therefore, $D^{2}K_{\infty}^{(1)}\succeq H_{\infty}$ .

∎

Proof of Theorem 3.2.

Let us rewrite Corollary 5.2 into

[TABLE]

here $\mathcal{K}_{t}\mathrel{\mathop{\mathchar 58\relax}}L^{2}_{\mu}(x)\rightarrow L^{2}_{\mu}(x)$ denotes the integral operator associated with $K_{t}$ ,

[TABLE]

From (7.14)

[TABLE]

we know that the RHS equals zero implies

[TABLE]

This further implies $\Delta_{\infty}$ lies in the kernel of RKHS $\mathcal{K}_{\infty}$ as $\mathcal{K}_{\infty}=\{\mathcal{K}_{\infty}^{1/2}g\mathrel{\mathop{\mathchar 58\relax}}g\in L^{2}_{\mu}\}$ . ∎

Proof of Proposition 4.1.

The gradients on the original parameters are,

[TABLE]

Clearly, on the rescaled parameter, the following holds

[TABLE]

Multiply the first equation by $\theta_{j}$ , and the second equation by $\theta_{j}^{T}$ , take the difference, we can verify that

[TABLE]

Therefore the balanced condition still holds at stationarity for arbitrary bounded initialization,

[TABLE]

Now the optimality condition for the velocity field reads the following, for any $\Theta_{j}(\infty)\in{\rm supp}(\widehat{\rho}_{\infty}^{\lambda})$ (we abbreviate the $\infty$ in the following display, note $\tilde{\theta}(\infty)$ corresponds to the second layer weights w.r.t. to $\tilde{\Theta}(\infty)$ )

[TABLE]

where the last step uses the condition $\theta_{j}^{2}(\infty)=\|\Theta_{j}(\infty)\|^{2}$ , and the fact that $|\widehat{\rho}_{\infty}^{\lambda}|=\frac{1}{m}\sum_{j=1}^{m}\delta_{\Theta_{j}}$ and

[TABLE]

In the matrix form, where $\widehat{\rho}^{\lambda}_{\infty}=\frac{1}{m}\sum_{l\in[m]}{\rm sgn}(\theta_{l})\delta_{\Theta_{l}}$

[TABLE]

Therefore, define $\sigma(x^{T}\Xi)\mathrel{\mathop{\mathchar 58\relax}}=[\sigma(x^{T}\Theta_{1})\ldots,\sigma(x^{T}\Theta_{m})]\in\mathbb{R}^{1\times m}$ , and $\sigma(X\Xi)\mathrel{\mathop{\mathchar 58\relax}}=[\sigma(x_{1}^{T}\Xi)^{T},\ldots,\sigma(x_{n}^{T}\Xi)^{T}]\in\mathbb{R}^{m\times n}$ , we have

[TABLE]

The last line follows as $\widehat{H}^{\lambda}(x,\tilde{x})\mathrel{\mathop{\mathchar 58\relax}}=\int\sigma(x^{T}\Theta)\sigma(\tilde{x}^{T}\Theta)|\widehat{\rho}^{\lambda}_{\infty}|(d\Theta)=1/m\cdot\sigma(x^{T}\Xi)\sigma(\tilde{x}^{T}\Xi)^{T}$ .

∎

Proof of Proposition 4.2.

[TABLE]

For the first term, we can upper bound by $\sigma^{2}\mathop{\mathbf{E}}_{\mathbf{x}\sim\mu}\|H_{\infty}(X,X)^{-1}H_{\infty}(X,\mathbf{x})\|^{2}$ . The second term can be upper bounded by

[TABLE]

Proof is completed. ∎

Acknowledgement

We thank Maxim Raginsky for pointing out relevant references, and for providing helpful discussion.

Appendix A Appendix

A.1 Supporting Results

Proof of Lemma 5.3.

Let’s first show that in the infinite neuron limit $m\rightarrow\infty$ , $\rho_{+,t},\rho_{-,t}$ are properly defined. Therefore Eqn. (5.3) in the above theorem also characterize the distribution dynamics for infinite neurons NN, induced by gradient flow training. For simplicity, we assume the initialization $\rho_{+,0},\rho_{-,0}$ with bounded support. We add the superscript $m$ , $\rho_{+,t}^{m},\rho_{-,t}^{m},\rho_{t}^{m}$ to (5.8) to indicate their dependence on $m$ . Consider $\nabla_{\Theta}V$ , $\nabla_{\Theta}U(\Theta,\tilde{\Theta})$ in (5.11) are bounded and uniform Lipchitz continuous as in [Mei et al., 2018, A3]. With the same proof as in [Mei et al., 2018, Theorem 3], one can show that with $m\rightarrow\infty$ , the initial distribution $\rho^{m}_{0}\xrightarrow{d}\tilde{\rho}_{0}=\rho_{+,0}-\rho_{-,0}$ by law of large number, and by the solution’s continuity depending on the initial value. Therefore we have $\rho_{t}^{m}\xrightarrow{d}\rho_{t}$ as $m\rightarrow\infty$ well defined.

The velocity of a particle $\Theta$ in the positive part as a rewrite of (5.6)-(5.7) is

[TABLE]

resp. for the negative part and (5.7), we have

[TABLE]

Given the velocity of particle, we have the transport equation for gradient flow,

[TABLE]

To see this, recall the definition of weak derivative $\partial_{t}\rho_{t}$ : for any bounded smooth function $g$ , $\partial_{t}\rho_{t}$ is defined in the following sense

[TABLE]

We take any bounded smooth function $g(\Theta)$ , given the velocity of $\Theta$ ’s , then we have

[TABLE]

and $\rho_{-,t}$ correspondingly. By the weak derivative, we get the above PDE. We use the above dynamic description as the training process for infinite neuron NN. Plug above equation into $\rho_{t}=\rho_{+,t}-\rho_{-,t}$ and $|\rho_{t}|=\rho_{+,t}+\rho_{-,t}$ , we get

[TABLE]

∎

Proof of Proposition 5.1.

It suffices to show $\theta^{2}_{+,i}(t)=\|\Theta_{+,i}(t)\|_{2}^{2}$ and resp. $\theta^{2}_{-i}(t)=\|\Theta_{-,i}(t)\|_{2}^{2}$ . By our path dynamics, we have

[TABLE]

Thus, by the initialization, we have $\theta_{+,i}(t)=\|\Theta_{+,i}(t)\|$ , and resp. $\theta_{-,i}(t)=-\|\Theta_{-,i}(t)\|$ . ∎

Proposition A.1 (No sign change).

For the training process (1.3) for problem (1.1) with NN (1.2), once $w_{j}(t)$ and $u_{j}(t)$ hit zero at $t_{0}$ , for $t>t_{0}$ at least there exists a solution that can be viewed as training without the $j$ -th neuron.

Proof of Proposition A.1.

Using $w_{j}(t_{0})$ , $u_{j}(t_{0})$ , for $j\neq i$ , as an initial value for ODE (1.3) without the $i$ -th node. By assumption, we have a solution of this $2\cdot(2m-1)$ -dimensional initial value problem. Then padding the solution with $u_{i}\equiv 0$ and $w_{i}\equiv 0$ , which can be a solution for ODE (1.3) with $i$ -th neuron included. ∎

Proof of Lemma 5.1.

First we write down the dynamic of prediction $f(\tilde{x})$ at each point $\tilde{x}$ based on Eqn. (1.3). For notational simplicity, let $u_{j},w_{j}$ be $u_{j}(t),w_{j}(t)$ , and let $o^{1}_{j}(\tilde{x})=\sigma(u_{j}^{T}\tilde{x})$ , and with the square loss $\ell(y,f)=\frac{1}{2}(y-f)^{2}$ , we have

[TABLE]

Therefore, we have

[TABLE]

∎

Proof of Corollary 5.1.

The first equality follows from the proof in Lemma 5.1. Recall the property for strongly convex function

[TABLE]

Therefore $-\mathbf{E}_{\mathbf{x},\mathbf{\tilde{x}}}\left[\Delta_{t}(\mathbf{x})K_{t}(\mathbf{x},\mathbf{\tilde{x}})\Delta_{t}(\mathbf{\tilde{x}})\right]\leq-\frac{\lambda_{t}}{n}\sum_{i=1}^{n}\Delta_{t}(x_{i})^{2}\leq-2\alpha\lambda_{t}\cdot\widehat{\mathop{\mathbf{E}}}\left[\ell(\mathbf{y},f_{t}(\mathbf{x}))-\ell(\mathbf{y},f_{*}(\mathbf{x}))\right].$ ∎

Proof of Lemma 5.2.

We know

[TABLE]

Consider the coordinate system $e_{1},e_{2},\ldots e_{d}$ such that $e_{1},e_{2}$ spans the space of $x,\tilde{x}$ , with

[TABLE]

where $\theta=\arccos(x^{T}\tilde{x})$ . Note $\mathbf{u}=[v_{1},v_{2},\ldots v_{d}]$ is still an isotropic Gaussian under this coordinate system. The constraint reads

[TABLE]

and one can see that $v_{2},\ldots v_{d}$ integrate out.

Let’s focus on the spherical coordinates of $v_{1}=r\cos\phi,v_{2}=r\sin\phi$ , then $r^{2}\sim\chi^{2}(2)$ and $\phi\sim U[-\pi,\pi]$ . W.l.o.g., we can consider the case when $\theta\in[0,\pi]$ .

[TABLE]

Therefore, we get

[TABLE]

Similarly, we have

[TABLE]

Summing them up, we get the result. ∎

Proof of Corollary 5.2.

Our proof essentially follows the same steps for (5.1). First, we write down the dynamic of $f_{t}(x)$ ,

[TABLE]

Plug-in the training dynamic (A.4), we get

[TABLE]

Therefore, we have

[TABLE]

∎

A.2 Extensions

In this section, we extend the definition of the dynamic kernel in Section 5 to the multi-layer neural networks case. We construct a recursive expression for the kernel defined by the multi-layer perceptron (MLP). Let $\Theta_{i,j}^{l}$ , $l=0,\cdots,h-1$ denote the coefficient from the $i$ -th node on the $l$ -th layer to the $j$ -th node on the $(l+1)$ -th layer. Let the input (before activation) of the $i$ -th node on $l$ -th layer be $v_{i}^{l}(x)=\sum_{j}\Theta^{l-1}_{j,i}o^{l-1}_{j}(x)$ and let the output at that node be $o_{i}^{l}=\sigma(v_{i}^{l})$ , for $l\notin\{0,h\}$ , and $o_{i}^{l}=x_{i}$ , for $l=0$ . The final output $g(x)=(v^{h}_{1}(x),v^{h}_{2}(x),\cdots,v^{h}_{L_{h}}(x))^{T}$ . Let $L_{0}=d$ and $L_{i}$ is the number of nodes at the $i$ -th layer. Denote $K_{t}^{h}(x,\tilde{x};\{\Theta^{l}\}_{l=0,\dots,h})$ the kernel of $h$ layers NN. The training dynamic is still the gradient flow, for all $\Theta$

[TABLE]

Proposition A.2.

For a $(h+1)$ -layer NN function denoted by $g(x)$ , for simplicity, let

[TABLE]

With gradient flow training process, we have the following recursive representation of the corresponding kernel matrix

[TABLE]

Here the kernel matrix is always positive semidefinite.

Proof of Proposition A.2.

For notational simplicity, let $K_{t}^{h+1}(x,\tilde{x})=K_{t}^{h+1}(x,\tilde{x};\{\Theta^{l}\}_{l=0,\dots,h+1})$ , and

[TABLE]

For the proof, we calculate the dynamic of prediction $g(x)$ , by elementary calculus, we have

[TABLE]

With same calculation for the dynamic of $\Delta_{t}$ as in (A.7), we get

[TABLE]

By induction, we get

[TABLE]

Now, we prove the positive semi-definiteness of the kernel. By induction, we only need to prove that the second term above is non-negative. We construct a canonical mapping $\phi_{h+1}(x)\mathrel{\mathop{\mathchar 58\relax}}=v(x),\mathbb{R}^{d}\rightarrow\mathbb{R}^{L_{0}\times L_{1}}$ , whereas the $i,j$ -th coordinate $v(x)_{i,j}=\frac{\partial g(x)}{\partial\Theta^{0}_{i,j}}$ . Then the second term can be seen as a inner product $\langle\phi_{h+1}(x),\phi_{h+1}(\tilde{x})\rangle$ , which implies the non-negativity. ∎

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anthony and Bartlett [2009] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations . cambridge university press, 2009.
2Bach [2017] Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research , 18(19):1–53, 2017.
3Barron et al. [2008] Andrew R Barron, Albert Cohen, Wolfgang Dahmen, Ronald A De Vore, et al. Approximation and learning by greedy algorithms. The annals of statistics , 36(1):64–94, 2008.
4Belkin et al. [2018 a] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. ar Xiv preprint ar Xiv:1812.11118 , 2018 a.
5Belkin et al. [2018 b] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. ar Xiv preprint ar Xiv:1802.01396 , 2018 b.
6Bengio et al. [2006] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Advances in neural information processing systems , pages 123–130, 2006.
7Casselman [2014] Bill Casselman. Essays in analysis. 2014.
8Chizat and Bach [2018] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. ar Xiv preprint ar Xiv:1812.07956 , 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits

Abstract

1 Introduction

1.1 Problem Formulation

2 Preliminaries and Summary

2.1 Notations

2.2 Preliminaries

2.3 Organization and Summary

3 Main Results: Benefits of Adaptive Representation

3.1 Gradient Flow, Projection and Adaptive RKHS

Theorem 3.1** (Approximation).**

Remark 3.1**.**

Corollary 3.1** (ERM).**

3.2 Representation Benefits of Adaptive RKHS

Theorem 3.2** (Representation Benefits).**

Remark 3.2**.**

4 Implications of the Adaptive Theory

Example: Gap in Spaces H∞\mathcal{H}_{\infty}H∞​ and K∞\mathcal{K}_{\infty}K∞​.

Connections to Min-norm Interpolation.

Proposition 4.1** (Interpolation: Connection to Kernel Ridgeless Regression).**

Connections to Random Kitchen Sinks.

Adaptive Generalization Theory.

Proposition 4.2** (Adaptive Generalization).**

5 Time-varying Kernels and Evolution

Lemma 5.1** (Dynamic kernel of finite neurons GD).**

Corollary 5.1**.**

5.1 Initialization, Rescaling and K0K_{0}K0​

Initialization.

Proposition 5.1** (Balanced condition).**

Rescaling.

Lemma 5.2** (Fixed Kernel).**

5.2 Evolution of ρt\rho_{t}ρt​

Lemma 5.3** (Dynamic Kernel and Evolution).**

Remark 5.1**.**

5.3 Two RKHS: K∞\mathcal{K}_{\infty}K∞​ and H∞\mathcal{H}_{\infty}H∞​

Corollary 5.2**.**

Proposition 5.2**.**

6 Experiments

7 Main Proofs

Proof of Theorem 3.1.

Proof of Proposition 5.2.

Proof of Theorem 3.2.

Proof of Proposition 4.1.

Proof of Proposition 4.2.

Acknowledgement

Appendix A Appendix

A.1 Supporting Results

Proof of Lemma 5.3.

Proof of Proposition 5.1.

Proposition A.1** (No sign change).**

Proof of Proposition A.1.

Proof of Lemma 5.1.

Proof of Corollary 5.1.

Proof of Lemma 5.2.

Proof of Corollary 5.2.

A.2 Extensions

Proposition A.2**.**

Proof of Proposition A.2.

Theorem 3.1 (Approximation).

Remark 3.1.

Corollary 3.1 (ERM).

Theorem 3.2 (Representation Benefits).

Remark 3.2.

Example: Gap in Spaces $\mathcal{H}_{\infty}$ and $\mathcal{K}_{\infty}$ .

Proposition 4.1 (Interpolation: Connection to Kernel Ridgeless Regression).

Proposition 4.2 (Adaptive Generalization).

Lemma 5.1 (Dynamic kernel of finite neurons GD).

Corollary 5.1.

5.1 Initialization, Rescaling and $K_{0}$

Proposition 5.1 (Balanced condition).

Lemma 5.2 (Fixed Kernel).

5.2 Evolution of $\rho_{t}$

Lemma 5.3 (Dynamic Kernel and Evolution).

Remark 5.1.

5.3 Two RKHS: $\mathcal{K}_{\infty}$ and $\mathcal{H}_{\infty}$

Corollary 5.2.

Proposition 5.2.

Proposition A.1 (No sign change).

Proposition A.2.