Learning time-scales in two-layers neural networks

Rapha\"el Berthier; Andrea Montanari; Kangjie Zhou

arXiv:2303.00055·cs.LG·March 25, 2025

Learning time-scales in two-layers neural networks

Rapha\"el Berthier, Andrea Montanari, Kangjie Zhou

PDF

Open Access

TL;DR

This paper investigates the multi-scale and intermittent learning dynamics of two-layer neural networks in high-dimensional settings, revealing how different phases of training occur on distinct time scales.

Contribution

It provides a new theoretical framework for understanding the separation of time scales and intermittency in neural network training dynamics.

Findings

01

Identification of multiple learning time scales.

02

Demonstration of intermittency in gradient flow.

03

Validation through numerical simulations.

Abstract

Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the…

Figures8

Click any figure to enlarge with its caption.

Equations760

y_{i} = φ (⟨ u_{*}, x_{i} ⟩), x_{i} \sim N (0, I_{d}), u_{*} \in S^{d - 1},

y_{i} = φ (⟨ u_{*}, x_{i} ⟩), x_{i} \sim N (0, I_{d}), u_{*} \in S^{d - 1},

f (x; a, u) = \frac{1}{m} i = 1 \sum m a_{i} σ (⟨ u_{i}, x ⟩), a_{1}, \dots, a_{m} \in R, u_{1}, \dots, u_{m} \in S^{d - 1},

f (x; a, u) = \frac{1}{m} i = 1 \sum m a_{i} σ (⟨ u_{i}, x ⟩), a_{1}, \dots, a_{m} \in R, u_{1}, \dots, u_{m} \in S^{d - 1},

\mathscrsfs R (a, u)

\mathscrsfs R (a, u)

\displaystyle=\frac{1}{2}\mathbb{E}\Big{\{}\Big{(}\varphi(\langle u_{*},x\rangle)-\frac{1}{m}\sum_{i=1}^{m}a_{i}\sigma(\langle u_{i},x\rangle)\Big{)}^{2}\Big{\}}\,.

\partial_{t} (ε a_{i})

\partial_{t} (ε a_{i})

\partial_{t} u_{i}

(a_{i, init}, u_{i, init}) \sim P_{A} \otimes Unif (S^{d - 1}),

(a_{i, init}, u_{i, init}) \sim P_{A} \otimes Unif (S^{d - 1}),

φ (z) = k = 0 \sum \infty φ_{k} He_{k} (z), σ (z) = k = 0 \sum \infty σ_{k} He_{k} (z) .

φ (z) = k = 0 \sum \infty φ_{k} He_{k} (z), σ (z) = k = 0 \sum \infty σ_{k} He_{k} (z) .

\mathscrsfs R_{init} := m \to \infty lim d \to \infty lim \mathscrsfs R (a_{init}, u_{init}) = \frac{1}{2} (φ_{0} - σ_{0} \int a P_{A} (d a))^{2} + \frac{1}{2} k ⩾ 1 \sum φ_{k}^{2} .

\mathscrsfs R_{init} := m \to \infty lim d \to \infty lim \mathscrsfs R (a_{init}, u_{init}) = \frac{1}{2} (φ_{0} - σ_{0} \int a P_{A} (d a))^{2} + \frac{1}{2} k ⩾ 1 \sum φ_{k}^{2} .

\mathscrsfs R_{\infty} (t, ε) = m \to \infty lim d \to \infty lim \mathscrsfs R (a (t), u (t)) .

\mathscrsfs R_{\infty} (t, ε) = m \to \infty lim d \to \infty lim \mathscrsfs R (a (t), u (t)) .

\mathscrsfs R_{\infty} (t, ε) ε \to 0, t \to 0 ⎩ ⎨ ⎧ \mathscrsfs R_{init} \frac{1}{2} \sum_{k ⩾ 1} φ_{k}^{2} \frac{1}{2} \sum_{k ⩾ 2} φ_{k}^{2} \frac{1}{2} \sum_{k ⩾ l} φ_{k}^{2} if t = o (ε), if t = ω (ε) and t = \frac{1}{4∣ σ _{1} φ _{1} ∣} ε^{\nicefrac 12} lo g \frac{1}{ε} - ω (ε^{\nicefrac 12}), if t = \frac{1}{4∣ σ _{1} φ _{1} ∣} ε^{\nicefrac 12} lo g \frac{1}{ε} + ω (ε^{\nicefrac 12}) and t = c_{2} ε^{\nicefrac 14} - ω (ε^{\nicefrac 13}), if t = c_{l - 1} ε^{\nicefrac 1 2 (l - 1)} + ω (ε^{\nicefrac 1 l}) and t = c_{l} ε^{\nicefrac 1 2 l} - ω (ε^{\nicefrac 1 l + 1}), for all 3 ⩽ l ⩽ L + 1.

\mathscrsfs R_{\infty} (t, ε) ε \to 0, t \to 0 ⎩ ⎨ ⎧ \mathscrsfs R_{init} \frac{1}{2} \sum_{k ⩾ 1} φ_{k}^{2} \frac{1}{2} \sum_{k ⩾ 2} φ_{k}^{2} \frac{1}{2} \sum_{k ⩾ l} φ_{k}^{2} if t = o (ε), if t = ω (ε) and t = \frac{1}{4∣ σ _{1} φ _{1} ∣} ε^{\nicefrac 12} lo g \frac{1}{ε} - ω (ε^{\nicefrac 12}), if t = \frac{1}{4∣ σ _{1} φ _{1} ∣} ε^{\nicefrac 12} lo g \frac{1}{ε} + ω (ε^{\nicefrac 12}) and t = c_{2} ε^{\nicefrac 14} - ω (ε^{\nicefrac 13}), if t = c_{l - 1} ε^{\nicefrac 1 2 (l - 1)} + ω (ε^{\nicefrac 1 l}) and t = c_{l} ε^{\nicefrac 1 2 l} - ω (ε^{\nicefrac 1 l + 1}), for all 3 ⩽ l ⩽ L + 1.

y_{i} = φ (U_{*}^{⊤} x_{i}) + ε_{i}, U_{*} \in R^{k \times d}, φ : R^{k} \to R,

y_{i} = φ (U_{*}^{⊤} x_{i}) + ε_{i}, U_{*} \in R^{k \times d}, φ : R^{k} \to R,

V (s)

V (s)

U (s)

sup {∥ σ^{'} ∥_{L^{2}}, ∥ σ^{''} ∥_{L^{2}}} \leq M_{2}, sup {∥ φ ∥_{L^{2}}, ∥ φ^{'} ∥_{L^{2}}, ∥ φ^{''} ∥_{L^{2}}} \leq M_{2} .

sup {∥ σ^{'} ∥_{L^{2}}, ∥ σ^{''} ∥_{L^{2}}} \leq M_{2}, sup {∥ φ ∥_{L^{2}}, ∥ φ^{'} ∥_{L^{2}}, ∥ φ^{''} ∥_{L^{2}}} \leq M_{2} .

s \in [- 1, 1] sup ∣ V^{'} (s) ∣ = (a) s \in [- 1, 1] sup ∣ E {φ^{'} (G) σ^{'} (G_{s})} ∣ \leq (b) ∥ φ^{'} ∥_{L^{2}} ∥ σ^{'} ∥_{L^{2}} \leq M_{2}^{2},

s \in [- 1, 1] sup ∣ V^{'} (s) ∣ = (a) s \in [- 1, 1] sup ∣ E {φ^{'} (G) σ^{'} (G_{s})} ∣ \leq (b) ∥ φ^{'} ∥_{L^{2}} ∥ σ^{'} ∥_{L^{2}} \leq M_{2}^{2},

\mathscrsfs R (a, u) = \mathscrsfs R_{\mbox red} (a, s, R) := \frac{1}{2} ∥ φ ∥_{L^{2}}^{2} - \frac{1}{m} i = 1 \sum m a_{i} V (s_{i}) + \frac{1}{2 m ^{2}} i, j = 1 \sum m a_{i} a_{j} U (r_{ij}) .

\mathscrsfs R (a, u) = \mathscrsfs R_{\mbox red} (a, s, R) := \frac{1}{2} ∥ φ ∥_{L^{2}}^{2} - \frac{1}{m} i = 1 \sum m a_{i} V (s_{i}) + \frac{1}{2 m ^{2}} i, j = 1 \sum m a_{i} a_{j} U (r_{ij}) .

ε \partial_{t} a_{i} =

ε \partial_{t} a_{i} =

\partial_{t} s_{i} =

\partial_{t} r_{ij} =

+ a_{j} (V^{'} (s_{j}) (s_{i} - s_{j} r_{ij}) - \frac{1}{m} p = 1 \sum m a_{p} U^{'} (r_{j p}) (r_{i p} - r_{j p} r_{ij})) .

\displaystyle\sup_{t\in[0,T]}\big{|}\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{\mbox{\tiny\rm red}}(a^{0}(t),s^{0}(t),R^{0}(t))\big{|}\leq\frac{CM}{\sqrt{d}}\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right)\,,

\displaystyle\sup_{t\in[0,T]}\big{|}\mathscrsfs{R}(a(t),u(t))-\mathscrsfs{R}_{\mbox{\tiny\rm red}}(a^{0}(t),s^{0}(t),R^{0}(t))\big{|}\leq\frac{CM}{\sqrt{d}}\exp\left(MT(1+T)^{2}/\varepsilon^{2}\right)\,,

max (t \in [0, T] sup \frac{1}{m} ∥ a (t) - a^{0} (t) ∥_{2}, \frac{1}{m} t \in [0, T] sup ∥ s (t) - s^{0} (t) ∥_{2}) \leq \frac{1}{d} \cdot C exp (M T (1 + T)^{2} / ε^{2}),

t \in [0, T] sup \frac{1}{m} ∥ R (t) - R^{0} (t) ∥_{F} \leq \frac{1}{d} \cdot C exp (M T (1 + T)^{2} / ε^{2}) .

\mathscrsfs R_{\mbox mf} (a, s) := \mathscrsfs R_{\mbox red} (a, s, R = s s^{⊤}) = \frac{1}{2} ∥ φ ∥_{L^{2}}^{2} - \frac{1}{m} i = 1 \sum m a_{i} V (s_{i}) + \frac{1}{2 m ^{2}} i, j = 1 \sum m a_{i} a_{j} U (s_{i} s_{j}) .

\mathscrsfs R_{\mbox mf} (a, s) := \mathscrsfs R_{\mbox red} (a, s, R = s s^{⊤}) = \frac{1}{2} ∥ φ ∥_{L^{2}}^{2} - \frac{1}{m} i = 1 \sum m a_{i} V (s_{i}) + \frac{1}{2 m ^{2}} i, j = 1 \sum m a_{i} a_{j} U (s_{i} s_{j}) .

ε \partial_{t} a_{i} = \partial_{t} s_{i} = V (s_{i}) - \frac{1}{m} j = 1 \sum m a_{j} U (s_{i} s_{j}), a_{i} (1 - s_{i}^{2}) (V^{'} (s_{i}) - \frac{1}{m} j = 1 \sum m a_{j} U^{'} (s_{i} s_{j}) s_{j}) .

ε \partial_{t} a_{i} = \partial_{t} s_{i} = V (s_{i}) - \frac{1}{m} j = 1 \sum m a_{j} U (s_{i} s_{j}), a_{i} (1 - s_{i}^{2}) (V^{'} (s_{i}) - \frac{1}{m} j = 1 \sum m a_{j} U^{'} (s_{i} s_{j}) s_{j}) .

C (T) = M exp (M T (1 + T)^{2} / ε^{2})

C (T) = M exp (M T (1 + T)^{2} / ε^{2})

\sup_{t\in[0,T]}\frac{1}{m}\sum_{i=1}^{m}\big{\|}(a_{i}(t),s_{i}(t))-(a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))\big{\|}_{2}^{2}\leq\frac{C(T)}{m}\,.

\sup_{t\in[0,T]}\frac{1}{m}\sum_{i=1}^{m}\big{\|}(a_{i}(t),s_{i}(t))-(a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))\big{\|}_{2}^{2}\leq\frac{C(T)}{m}\,.

t \in [0, T] sup \mathscrsfs R_{\mbox red} (a (t), s (t), R (t)) - \mathscrsfs R_{\mbox mf} (a^{\mbox mf} (t), s^{\mbox mf} (t)) \leq \frac{C ( T )}{m} .

t \in [0, T] sup \mathscrsfs R_{\mbox red} (a (t), s (t), R (t)) - \mathscrsfs R_{\mbox mf} (a^{\mbox mf} (t), s^{\mbox mf} (t)) \leq \frac{C ( T )}{m} .

t \in [0, T] sup \mathscrsfs R (a (t), u (t)) - \mathscrsfs R_{\mbox mf} (a^{\mbox mf} (t), s^{\mbox mf} (t)) \leq (\frac{1}{d} + \frac{1}{m}) C M exp (M T (1 + T)^{2} / ε^{2}) .

t \in [0, T] sup \mathscrsfs R (a (t), u (t)) - \mathscrsfs R_{\mbox mf} (a^{\mbox mf} (t), s^{\mbox mf} (t)) \leq (\frac{1}{d} + \frac{1}{m}) C M exp (M T (1 + T)^{2} / ε^{2}) .

ρ_{t}

ρ_{t}

ρ_{t}

\partial_{t} ρ_{t} (a, s)

\partial_{t} ρ_{t} (a, s)

:= - (\partial_{a} (ρ_{t} Ψ_{a} (a, s; ρ_{t})) + \partial_{s} (ρ_{t} Ψ_{s} (a, s; ρ_{t}))),

Ψ_{a} (a, s; ρ) =

Ψ_{a} (a, s; ρ) =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques

Full text

Learning time-scales in two-layers neural networks

Raphaël Berthier, Andrea Montanari, Kangjie Zhou EPFLDepartment of Electrical Engineering and Department of Statistics, Stanford UniversityDepartment of Statistics, Stanford University

Abstract

Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically ‘simpler’ or ‘easier to learn’ although in a way that is difficult to formalize.

Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.

1 Introduction
2 Setting and standard learning scenario
3 Further related work
4 The large-network, high-dimensional limit
4.1 Connection with mean field theory
4.2 A general formulation
5 Numerical solution
6 Timescales hierarchy in the gradient flow dynamics
6.1 First time scale: constant component
6.2 Second time scale: linear component I
6.3 Third time scale: linear component II
6.4 Conjectured behavior for larger time scales
7 Stochastic gradient descent and finite sample size
A Appendix to Section 4
A.1 Proof of Proposition 1
A.2 Proof of Corollary 1
A.3 Proof of Proposition 2
A.4 Derivation of the mean field dynamics (28)
A.5 Details of the alternative mean field approach
B Calculations for the analysis of mean-field gradient flow
B.1 Solution of Eq. (83)
B.2 Induced approximation of the risk
B.3 Proof of Theorem 1
C Proofs of Theorem 2 and 3: learning with projected SGD
C.1 Difference between GF and GD
C.2 Difference between GD and SGD
C.3 Difference between SGD and projected SGD
C.4 Proof of Theorem 3
D Counterexamples to the standard learning scenario
D.1 Case 1: $\sigma_{k}=0$ for some $k\in\mathbb{N}$
D.2 Case 2: $\varphi_{0}=\cdots=\varphi_{k}=0$ for some $k\geq 1$
D.3 Case 3: $\varphi_{k}=0$ for some $k\geq 1$

1 Introduction

It is a recurring empirical observation that the training dynamics of neural networks exhibits a whole range of surprising behaviors:

Plateaus. Plotting the training and test error as a function of SGD steps, using either small stepsize or large batches to average out stochasticity, reveals striking patterns. These error curves display long plateaus where barely anything seems to be happening, which are followed by rapid drops (Saad and Solla, 1995; Yoshida and Okada, 2019; Power et al., 2022). 2. 2.

Time-scales separation. The time window for this rapid descent is much shorter than the time spent in the plateaus. Additionally, subsequent phases of learning take increasingly longer times (Ghorbani et al., 2020a; Barak et al., 2022). 3. 3.

Incremental learning. Models learnt in the first phases of learning appear to be simpler than in later phases. Among others, Arpit et al. (2017) demonstrated that easier examples in a dataset are learned earlier; Kalimeris et al. (2019) showed that models learnt in the first phase of training correlate well with linear models; Gissin et al. (2019) showed that, in many simplified models, the dynamics of gradient descent explores the solution space in an incremental order of complexity; Power et al. (2022) demonstrated that, in certain settings, a function that approximates well the target is only learnt past the point of overfitting.

Understanding these phenomena is not a matter of intellectual curiosity. In particular, incremental learning plays a key role in our understanding of generalization in deep learning. Indeed, in this scenario, stopping the learning at a certain time $t$ amounts to controlling the complexity of the model learnt. The notion of complexity corresponds to the order in which the space of models is explored.

While a number of groups have developed models to explain these phenomena, it is fair to say that a complete picture is still lacking. An exhaustive overview of these works is out of place here. We will outline three possible explanations that have been developed in the past, and provide more pointers in Section 3.

Theory $\#1$ : Dynamics near singular points.

Several early works (Saad and Solla, 1995; Fukumizu and Amari, 2000; Wei et al., 2008) pointed out that the parametrization of multi-layer neural networks presents symmetries and degeneracies. For instance, the function represented by a multilayer perceptron is invariant under permutations of the neurons in the same layer. As a consequence, the population risk has multiple local minima connected through saddles or other singular sub-manifolds. Dynamics near these sub-manifolds naturally exhibits plateaus. Further, random or agnostic initializations typically place the network close to such submanifolds.

Theory $\#2$ : Linear networks.

Following the pioneering work of Baldi and Hornik (1989), a number of authors, most notably Saxe et al. (2013); Li et al. (2020), studied the behavior of deep neural networks with linear activations. While such networks can only represent linear functions, the training dynamics is highly non-linear. As demonstrated in Saxe et al. (2013), learning happens through stages that correspond to the singular value decomposition of the input-output covariance. Time scales are determined by the singular values.

Theory $\#3$ : Kernel regime.

Following an initial insight of Jacot et al. (2018), a number of groups proved that, for certain initializations, the training dynamics and model learnt by overparametrized neural networks is well approximated by certain linearly parametrized models. In the limit of very wide networks, the training dynamics of these models converges in turn to the training dynamics of kernel ridge(less) regression (KRR) with respect to a deterministic kernel (independent of the random initialization.) We refer to Bartlett et al. (2021) for an overview and pointers to this literature. Recently Ghosh et al. (2021) show that, in high dimension, the learning dynamics of KRR also exhibits plateaus and waterfalls, and learns functions of increasing complexity over a diverging sequence of timescales.

While each of these theories offers useful insights, it is important to realize that they do not agree on the basic mechanism that explains plateaus, time-scales separation, and incremental learning. In theory $\#1$ , plateaus are associated to singular manifolds and high-dimensional saddles, while in theories $\#2$ and $\#3$ they are related to a hierarchy of singular values of a certain matrix. In $\#2$ , the relevant singular values are the ones of the input-output covariance, and the fact that these singular values are well separated is postulated to be a property of the data distribution. In contrast, in $\#3$ the relevant singular values are the eigenvalues of the kernel operator, and hence completely independent of the output (the target function). In this case, eigenvalues which are very different are proved to exist under natural high-dimensional distributions.

Not only these theories propose different explanations, but they are also motivated by very different simplified models. Theory $\#1$ has been developed only for networks with a small number of hidden units. Theory $\#2$ only applies to networks with multiple output units, because otherwise the input-output covariance is a $d\times 1$ matrix and hence has only one non-trivial singular value. Finally, theory $\#3$ applies under the conditions of the linear (a.k.a. lazy) regime, namely large overparametrization and suitable initialization (see, e.g., Bartlett et al. (2021)).

In order to better understand the origin of plateaus, time-scales separation, and incremental learning, we attempt a detailed analysis of gradient flow for two-layer neural networks. We consider a simple data-generation model, and propose a precise scenario for the behavior of learning dynamics. We do not assume any of the simplifying features of the theories described above: activations are non-linear; the number of hidden neurons is large; we place ourselves outside the linear (lazy) regime.

Our analysis is based on methods from dynamical systems theory: singular perturbation theory and matched asymptotic expansions. Unfortunately, we fall short of providing a general rigorous proof of the proposed scenario, but we can nevertheless prove it in several special cases and provide a heuristic argument supporting its generality.

The rest of the paper is organized as follows. Section 2 describes our data distribution, learning model, and the proposed scenario for the learning dynamics. We review further related work in Section 3. Section 4 describes the reduction of the gradient flow to a ‘mean field’ dynamics that will be the starting point of our analysis. Section 5 presents numerical evidence of the proposed learning scenario. Finally, Sections 6 to 7 present our analysis of the learning dynamics.

Notations.

In this paper, we use the classical asymptotic notations. The notations $f(\varepsilon)=o(g(\varepsilon))$ or $g(\varepsilon)=\omega(f(\varepsilon))$ as $\varepsilon\to 0$ both denote that $|f(\varepsilon)|/|g(\varepsilon)|\to 0$ in the limit $\varepsilon\to 0$ . The notations $f(\varepsilon)=O(g(\varepsilon))$ or $g(\varepsilon)=\Omega(f(\varepsilon))$ both denote that the ratio $|f(\varepsilon)|/|g(\varepsilon)|$ remains upper bounded in the limit. The notation $f(\varepsilon)=\Theta(g(\varepsilon))$ or $f(\varepsilon)\asymp g(\varepsilon)$ denote that $f(\varepsilon)=O(g(\varepsilon))$ and $g(\varepsilon)=O(f(\varepsilon))$ both hold. Finally, $f(\varepsilon)\sim g(\varepsilon)$ denotes that $f(\varepsilon)/g(\varepsilon)\to 1$ in the limit.

2 Setting and standard learning scenario

We are given pairs $\{(x_{i},y_{i})\}_{i\leq n}$ , where $x_{i}\in\mathbb{R}^{d}$ is a feature vector and $y_{i}\in\mathbb{R}$ is a response variable. We are interested in cases in which the feature vector is high-dimensional but does not contain strong structure, but the response depends on a low-dimensional projection of the data. We assume the simplest model of this type, the so-called single-index model:

[TABLE]

where $\varphi:\mathbb{R}\to\mathbb{R}$ is a link function, $\mathsf{N}(0,I_{d})$ denotes the standard multivariate Gaussian distribution in dimension $d$ , and $\mathbb{S}^{d-1}:=\{v\in\mathbb{R}^{d}:\,\|v\|_{2}=1\}$ . We study the ability to learn model (1) using a two-layers neural network with $m$ hidden neurons:

[TABLE]

where $(a,u):=(a_{1},\cdots,a_{m},u_{1},\cdots,u_{m})$ collectively denotes all the model’s parameter. The factor $1/m$ in the definition is relevant for the initialization and learning rate. We anticipate that we will initialize the $a_{i}$ ’s to be of order one, which results in second layer coefficients $a_{i}/m=\Theta(1/m)$ . This is often referred to as the ‘mean-field initialization’ and is known to drive learning process out of the linear or kernel regime, see e.g. (Mei et al., 2018b; Chizat and Bach, 2018; Ghorbani et al., 2020b; Yang and Hu, 2020; Abbe et al., 2022).

The bulk of our work will be devoted to the analysis of projected gradient flow in $(a_{i},u_{i})_{1\leqslant i\leqslant m}$ on the population risk

[TABLE]

In Section 7, we will bound the distance between stochastic gradient descent (SGD) and gradient flow in population risk. As a consequence, we will establish finite sample generalization guarantees for SGD learning.

Projected gradient flow with respect to the risk $\mathscrsfs{R}(a,u)$ is defined by the following ordinary differential equations (ODEs):

[TABLE]

It is useful to make a few remarks about the definition of gradient flow:

•

The projection $I_{d}-u_{i}u_{i}^{\top}$ ensures that $u_{i}$ remains on the unit sphere $\mathbb{S}^{d-1}$ .

•

The overall scaling of time is arbitrary, and the matching to SGD steps will be carried out in Section 7. The factors $m$ on the right-hand side are introduced for mathematical convenience, since the partial derivatives are of order $1/m$ .

•

The factor $\varepsilon$ introduced in the flow of the $a_{i}$ ’s reflects the fact that usually SGD is run with respect to the overall second-layer weights $(a_{i}/m)_{1\leq i\leq m}$ . This would correspond to taking $\varepsilon=1/m$ . However, we will keep $\varepsilon$ as a free parameter independent of $m$ , and study the evolution for small $\varepsilon$ .

We assume the initialization to be random with i.i.d. components $(a_{i,\rm{init}},u_{i,\rm{init}})$ :

[TABLE]

where ${\rm P}_{A}$ is a probability measure on $\mathbb{R}$ . The unique solution of the gradient flow ODEs with this initialization will be denoted by $(a(t),u(t))$ . We will be interested in the case of large networks ( $m\to\infty$ ) in high dimension ( $d\to\infty$ ). As shown below, the two limits commute (over fixed time horizons).

Our main finding is that, in a number of cases, $\varphi$ is learnt incrementally. Namely, the function $f(x;a(t),u(t))$ evolves over time according to a sequence of polynomial approximations of $\varphi(\langle u_{*},x\rangle)$ . These polynomial approximations are given by the decomposition of $\varphi$ in $L^{2}(\mathbb{R},\phi(x)\mathrm{d}x)$ , where $\phi(x)$ is the standard normal density: $\phi(x)=\exp(-x^{2}/2)/\sqrt{2\pi}$ . (For notational simplicity, we will use the shorthand $L^{2}$ instead of $L^{2}(\mathbb{R},\phi(x)\mathrm{d}x)$ in the sequel.)

In order to describe the polynomial approximations learnt during the training more explicitly, we decompose $\varphi$ and $\sigma$ into normalized Hermite polynomials:

[TABLE]

Here, $\mathrm{He}_{k}$ denotes the $k$ -th Hermite polynomial, normalized so that $\left\|{\mathrm{He}_{k}}\right\|_{L^{2}(\mathbb{R},\phi(x)\mathrm{d}x)}=1$ .

As we will see, the incremental learning behavior arises for small $\varepsilon$ . By the law of large numbers (see below), the following almost sure limit exists (provided ${\rm P}_{A}$ is square integrable)

[TABLE]

We are now in position to describe the scenario that we will study in the rest of the paper.

Definition 1.

We say that the standard learning scenario holds up to level $L$ for a certain target function $\varphi$ , activation $\sigma$ , and distribution ${\rm P}_{A}$ , if the followings hold:

The limit below exists:

[TABLE] 2. 2.

There exist constants $c_{2},\dots,c_{L+1}>0$ such that the following asymptotic holds as $\varepsilon\to 0$ , $t\to 0$ :

[TABLE]

Figure 1 provides a cartoon illustration of the standard learning scenario.

A specific realization of our general setup is determined by the triple $(\sigma,\varphi,{\rm P}_{A})$ , In the rest of the paper, we will provide evidence showing that the standard learning scenario holds in a number of cases. Nevertheless, we can also construct examples in which it does not hold:

•

If one or more of the Hermite coefficients of the activation vanish, then the standard scenario does not hold for general $\varphi$ . Specifically, if $\sigma_{k}=0$ , then for any $t$ the function $f(x;a(t),u(t))$ remains orthogonal to $\mathrm{He}_{k}(\langle u_{*},x\rangle)$ . In particular, if $\varphi_{k}\neq 0$ then the risk remains bounded away from zero for every $t$ . We refer to Appendix D.1 for a formal statement.

•

If the first Hermite $k+1$ coefficients of $\varphi$ vanish, $\varphi_{0}=\dots=\varphi_{k}=0$ , $k\geq 1$ , then the standard scenario does not hold. (See Appendix D.2 for the proof.)

•

In fact, we expect the standard scenario might fail every time one or more of the coefficients $\varphi_{k}$ vanish, for $k\geq 1$ . Appendix D.3 provides some heuristic justification for this failure.

Remark 2.1.

We can compare the standard learning scenario described here to the ones in earlier literature and described as theory $\#1$ , $\#2$ , $\#3$ in the introduction. There appears points of contact, but also important differences with both theory $\#1$ and $\#3$ :

•

As in theory $\#1$ , the plateaus and separation of time scales arise because the trajectory of gradient flow is approximated by a sequence of motions along submanifolds in the space of parameters $(a,u)$ . Along the $l$ -th such submanifold $f(x;a,u)$ is well-approximated by a degree- $l$ polynomial. Escaping each submanifold takes an increasingly longer time.

This is reminding of the motion between saddles investigated in earlier work (Saad and Solla, 1995; Fukumizu and Amari, 2000; Wei et al., 2008). However, unlike in earlier work, we will see that this applies to networks with a large (possibly diverging) number of hidden neurons. Also, we identify the subsequent phases of learning with the polynomial decomposition of Eq. (7).

•

As in theory $\#3$ , subsequent phases of learning correspond to increasingly accurate polynomial approximations of the target function $\varphi(\langle u_{*},x\rangle)$ . However, the underlying mechanism and time scales are completely different. In the linear regime, the different time scales emerge because of increasingly small eigenvalues of the neural tangent kernel. In that case, the time required to learn degree- $l$ polynomials is of order $d^{l}$ (Ghosh et al., 2021).

In contrast, in the standard learning scenario, polynomials of degree $l$ are learnt on a time scale of order one in $d$ (and only depending on the learning rate $\varepsilon$ ). This of course has important implications when approximating gradient flow by SGD. Within the linear regime, the sample size required to learn polynomial of order $l$ scales like $d^{l}$ (Ghosh et al., 2021), while in the standard scenario, it is only of order $d$ (see Section 7).

3 Further related work

As we mentioned in the introduction, plateaus and time scales in the learning dynamics of kernel models were analyzed by Ghosh et al. (2021). A sharp analysis for the related random features model was developed by Bodin and Macris (2021).

Our analysis builds upon the mean-field description of learning in two-layer neural networks, which was developed in a sequence of works, see, e.g., (Mei et al., 2018b; Rotskoff and Vanden-Eijnden, 2018; Chizat and Bach, 2018; Mei et al., 2019). In particular, we leverage the fact that, for the data distribution (1), the population risk function is invariant under rotations around the axis $u_{*}$ , and this allows for a dimensionality reduction in the mean field description. Similar symmetry argument were used by Mei et al. (2018b) and, more recently, by Abbe et al. (2022).

The single-index model can be learnt using simpler methods than large two-layer networks. Limiting ourselves to the case of gradient descent algorithms, Mei et al. (2018a) proved that gradient descent with respect to the non-convex empirical risk $\widehat{R}_{n}(u):=n^{-1}\sum_{i=1}^{n}(y_{i}-\varphi(u^{\top}x_{i}))^{2}$ converges to a near global optimum, provided $\varphi$ is strictly increasing. Ben Arous et al. (2021) considered online SGD under more challenging learning scenarios and characterized the time (sample size) for $|\langle u,u_{*}\rangle|$ to become significantly larger than for a random unit vector $u$ .

Learning in overparametrized two-layer networks under model (1) (or its variations) has been studied recently by several groups. In particular, Ba et al. (2022) considers a training procedure which runs a single step gradient descent followed by freezing the first layer and performing ridge regression with respect to the second layer. This scheme is amenable to a precise characterization of the generalization error. Bietti et al. (2022) consider a similar scheme in which a first phase of gradient descent is run to achieve positive correlation with the unknown direction $u_{*}$ . Damian et al. (2022) also consider a two-phases scheme, and prove consistency and excess risk bounds for a more general class of target functions whereby the first equation in (1) is replaced by

[TABLE]

with $k\ll d$ . In particular, near optimal error bounds are obtained under a non-degeneracy condition on $\nabla^{2}\varphi$ .

Abbe et al. (2022) consider a similar model whereby $x\sim\mathrm{Unif}(\{+1,-1\}^{d})$ , and $y=\varphi(x_{S})$ where $S\subseteq[d]$ , and $x_{S}=(x_{i})_{i\in S}$ (i.e., $x_{S}$ contains the coordinates of $x$ indexed by entries of $S$ ). Under a structural assumption on $\varphi$ (the ‘merged staircase property’), and for $|S|$ fixed, they prove the two stages algorithm learns the target function with sample complexity of order $d$ . This paper is technically related to ours in that it uses mean-field theory to obtain a characterization of learning in terms of a PDE in a reduced $(k+2)$ -dimensional space.

A similar model was studied by Barak et al. (2022) that bounds the sample complexity by $d^{O(k)}$ for learning parities on $k$ bits using gradient descent with large batches (if $k=O(1)$ , Barak et al. (2022) require $O(1)$ steps with batch size $d^{O(k)}$ ).

Let us emphasize that our objective is quite different from these works. We do not allow ourselves deviations from standard SGD and try to derive a precise picture of the successive phases of learning (in particular, we do not consider two-stage schemes or layer-by-layer learning). On the other hand, we focus on a relatively simple model.

To clarify the difference, it is perhaps useful to rephrase our claims in terms of sample complexity. While previous works show that the target function can be learnt with $O(d)$ samples, we claim that it is learnt by online SGD with test error $r$ from about $C(r,\varepsilon)d$ samples and characterize the dependence of $C(r,\varepsilon)$ on $r$ for small $\varepsilon$ . (Falling short of a proof in the general case.)

After posting an initial version of this paper, we became aware that Arnaboldi et al. (2023) independently derived equations similar to (14)-(18), (25), (119). There are technical differences, and hence we cannot apply their results directly. However, Section 4.1 and Appendix A.4 are analogous to their work.

4 The large-network, high-dimensional limit

The first step of our analysis is a reduction of the system of ODEs (4), (5), with dimension $m(d+1)$ to a system of ODEs in $2m$ dimensions. We will achieve this reduction in two steps:

$(i)$

First we reduce to a system in $m(m+3)/2$ dimensions for the variables $a_{i}$ , $\langle u_{i},u_{j}\rangle$ , $\langle u_{i},u_{*}\rangle$ . This reduction is exact and is quite standard.

$(ii)$

We then show that the products $\langle u_{i},u_{j}\rangle$ can be eliminated, with an error $O(1/m)$ . As further discussed below, the resulting dynamics could also be derived from the mean field theory of Mei et al. (2018b); Rotskoff and Vanden-Eijnden (2018); Chizat and Bach (2018); Mei et al. (2019) (with the required modifications for the constraints $\|u_{i}\|=1$ ).

In order to define formally the reduced system, we define the functions $U,V:[-1,1]\to\mathbb{R}$ via:

[TABLE]

Note that the above identities follow from (O’Donnell, 2014, Proposition 11.31). Throughout this section, we will make the following assumptions.

A1.

The distribution of weights at initialization, ${\rm P}_{A}$ is supported on $[-M_{1},M_{1}]$ .

A2.

The activation function is bounded: $\left\|{\sigma}\right\|_{\infty}\leq M_{2}$ . Additionally, the functions $V$ and $U$ are bounded and of class $C^{2}$ , with uniformly bounded first and second derivatives over $s\in[-1,1]$ . A sufficient condition for this is

[TABLE]

A3.

Responses are bounded, i.e., $\|\varphi\|_{\infty}\leq M_{3}$ .

Remark 4.1.

We hereby briefly explain the sufficiency of $L^{2}$ -boundedness of derivatives of $\sigma$ and $\varphi$ as claimed in Assumption A2. Suppose for example that $\left\|{\sigma^{\prime}}\right\|_{L^{2}}\leq M_{2}$ and $\left\|{\varphi^{\prime}}\right\|_{L^{2}}\leq M_{2}$ , then we have

[TABLE]

where $(a)$ follows from Gaussian integration by parts and $(b)$ follows from Cauchy-Schwarz inequality.

Our first statement establishes reduction $(i)$ mentioned above. The proof of this fact is presented in Appendix A.1.

Proposition 1 (Reduction to $d$ -independent flow).

Define $s_{i}=\langle u_{i},u_{*}\rangle$ , $r_{ij}=\langle u_{i},u_{j}\rangle$ for $i,j=1,\dots,m$ . Then, letting $R=(r_{ij})_{i,j\leq m}$ , we have

[TABLE]

If $(a(t),u(t))$ solve the gradient flow ODEs (4)-(5) then $(a(t),s(t),R(t))$ are the unique solution of the following set of ODEs (note that $r_{ii}=1$ identically)

[TABLE]

The input dimension $d$ does not appear in the reduced ODEs, Eqs. (15) to (18), and only plays a role in the initialization of the $s_{i}$ ’s and the $r_{ij}$ ’s. Namely, since $u_{i,\rm{init}}\sim\mathrm{Unif}(\mathbb{S}^{d-1})$ , we can represent $u_{i,\rm{init}}=g_{i}/\|g_{i}\|_{2}$ with $g_{i}\sim\mathsf{N}(0,I_{d}/d)$ . By concentration of $\|g_{i}\|_{2}$ , this implies that, for $1\leq i<j\leq m$ , $s_{i}$ , $r_{ij}$ are approximately $\mathsf{N}(0,1/d)$ .

This discussion immediately yields the following consequence.

Corollary 1.

Let $(a(t),u(t))$ be the solution of the gradient flow ODEs (4), (5) with initialization (6), and let $(a^{0}(t),s^{0}(t),R^{0}(t))$ be the unique solution of Eqs. (15) to (18), with initialization $a^{0}_{i}(0)=a_{i}(0)$ , $s^{0}_{i}(0)=0$ , $r^{0}_{ij}(0)=0$ for $i\neq j$ . Then, for any fixed $T$ (possibly dependent on $m$ but not on $d$ ), the followings holds with probability at least $1-\exp(-C^{\prime}m)$ over the i.i.d. initialization $(a_{i}(0),u_{i}(0))_{i\in[m]}$ :

[TABLE]

Here $C,C^{\prime}$ are absolute constants and $M$ only depends on the $M_{i}$ ’s in Assumptions A1-A3.

The proof of Corollary 1 is deferred to Appendix A.2. From now on, we will assume the initialization $s^{0}_{i}(0)=0$ , $r^{0}_{ij}(0)=0$ for $i\neq j$ , but drop the superscript [math] for notational simplicity. We notice in passing that the right-hand sides of Eqs. (19) to (21) are independent of $m$ : this approximation step holds uniformly over $m$ . (Note that the left hand sides are normalized by $m$ as to yield the root mean square error per entry.)

In order to state the reduction $(ii)$ outlined above, we define the mean field risk as

[TABLE]

Further, we denote by $\{a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t)\}_{i=1}^{m}$ the solution to the following ODEs:

[TABLE]

Note that (23) would be identical to (15)-(16) if we had $r_{ij}=s_{i}s_{j}$ . A priori, this is not the case. However, the two systems of equations are close to each other for large $m$ as made precise by our next proposition, which formalizes reduction $(ii)$ .

Proposition 2 (Reduction to flow in $\mathbb{R}^{2m}$ ).

Let $(a_{i}(t),s_{i}(t),r_{ij}(t))_{1\leq i<j\leq m}$ be the unique solution of the ODEs (15)-(18) with initialization $s_{i}(0)=0$ , $r_{ij}(0)=0$ for all $1\leq i\neq j\leq m$ . Let $(a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))_{i\leq m}$ be the unique solution of the ODEs (23) with initialization $s^{\mbox{\tiny\rm mf}}_{i}(0)=0$ , $a^{\mbox{\tiny\rm mf}}_{i}(0)=a_{i}(0)$ for all $i\leq m$ .

If assumptions A1-A3 hold, then for any $T<\infty$ there exists a constant

[TABLE]

(with $M$ depending on the constants $\{M_{i}\}_{1\leq i\leq 3}$ appearing in Assumptions A1-A3 only) such that:

[TABLE]

Consequently,

[TABLE]

The proof of this proposition is deferred to Appendix A.3. Now, combining the propositions and corollaries in this section, we deduce that with high probability over the i.i.d. initialization,

[TABLE]

4.1 Connection with mean field theory

Consider the empirical distributions of the neurons:

[TABLE]

with $(a_{i}(t),s_{i}(t))_{i\leq m}$ , $(a^{\mbox{\tiny\rm mf}}_{i}(t),s^{\mbox{\tiny\rm mf}}_{i}(t))_{i\leq m}$ as in the statement of Proposition 2, i.e., solving (respectively) Eqs. (15)-(18) and Eq. (23) with initial conditions as given there.

Then, it is immediate to show that $\rho_{t}$ solves (in weak sense) the following continuity partial differential equation (PDE) (we refer to Ambrosio et al. (2005); Santambrogio (2015) for the definition of weak solutions and basic properties, and Appendix A.4 for a short derivation.)

[TABLE]

where $\Psi=(\Psi_{a},\Psi_{s})$ is given by

[TABLE]

This equation can be extended to a flow in the whole space $(\mathscr{P}(\mathbb{R}^{2}),W_{2})$ (all probability measures on $\mathbb{R}^{2}$ equipped with the second Wasserstein distance), and interpreted as gradient flow with respect to this metric in the following risk:

[TABLE]

which is the obvious extension of $\mathscrsfs{R}_{\mbox{\tiny\rm mf}}(a,s)$ of Eq. (22) to general probability distributions. Proposition 2 implies that for any $T<\infty$ , and under the above initial conditions,

[TABLE]

If we further denote by $\rho_{t}^{d}$ the empirical distribution of $(a_{i}(t),s_{i}(t))$ , $i\leq m$ , when $s_{i}(0)=\langle u_{i}(0),u_{*}\rangle$ , $u_{i}(0)\sim\mathrm{Unif}(\mathbb{S}^{d-1})$ , a further application of Corollary 1 yields

[TABLE]

Starting with Mei et al. (2018b); Chizat and Bach (2018); Rotskoff and Vanden-Eijnden (2018), several authors used continuity PDEs of the form (28) to study the learning dynamics of two-layer neural networks. Following the physics tradition, this is referred to as the ‘mean-field theory’ of two-layer neural networks. Appendix A.5 sketches an alternative approach to prove bounds of the form (25), (34) using the results of Mei et al. (2018b, 2019). The present derivation has the advantages of yielding a sharper bound and of being self-contained.

4.2 A general formulation

As mentioned above, the system of ODEs in Eq. (23) is a special case of the Wasserstein gradient flow of Eq. (28) whereby we set $\rho_{0}=m^{-1}\sum_{i=1}^{m}\delta_{(a_{i}^{\mbox{\tiny\rm mf}}(0),s_{i}^{\mbox{\tiny\rm mf}}(0))}$ . In order to study the solutions of Eq. (28) (hence Eq. (23)) we adopt the following framework. Let $(\Omega,\rho)$ denote a probability space. Let $a=a(\omega,t)$ and $s=s(\omega,t)$ ( $\omega\in\Omega$ , $t\geqslant 0$ ) be two measurable functions satisfying (dropping dependencies in $t$ below)

[TABLE]

If $\omega=i\in\Omega=\{1,\dots,m\}$ endowed with the uniform measure, we obtain the equations (23). In general, the push-forward $\rho_{t}$ of the measure $\rho$ through the map $\omega\in\Omega\mapsto(a(\omega,t),s(\omega,t))\in\mathbb{R}^{2}$ satisfies the mean-field equation (28). As a consequence, the dynamics (35) can be viewed as a gradient flow on the risk

[TABLE]

5 Numerical solution

In Figure 2, we present the result of an Euler discretization of Eqs. (23) where $\varphi$ is a degree- $2$ polynomial and $\sigma$ is the ReLU activation: $\sigma(s)=\max(s,0)$ ,

[TABLE]

These plots clearly display two of the features emphasized in the introduction: $(i)$ plateaus separated by periods of rapid improvement of the risk; $(ii)$ increasingly long timescales (notice the logarithmic time axis in the second and third row).

In order to examine the incremental learning structure, we rewrite the risk $\mathscrsfs{R}_{\mbox{\tiny\rm mf}}$ of Eq. (22) by decomposing $\varphi$ and $\sigma$ in the basis of Hermite polynomials

[TABLE]

We observe that, for small $\varepsilon$ , the Hermite coefficients of $\varphi$ are learned sequentially, in the order of their degree. When $\varepsilon$ is sufficiently small (right plots), this incremental learning happens in well separated phases. The plateaus and waterfalls in the plots of $\mathscrsfs{R}_{\mbox{\tiny\rm mf}}$ correspond to the network learning increasingly higher degree polynomials.

In Figure 3 we plot the evolution of the values of the $a_{i}$ and $s_{i}$ , for $i\in\{1,\dots,m\}$ . We observe that the order of magnitude of the $a_{i}$ ’s and the $s_{i}$ ’s increases when passing through the different phases of the incremental learning process.

Altogether, the results of Figures 2 and 3 are consistent with the standard learning scenario up to level $L=2$ as per Definition 1. While we conjecture that incremental learning also occurs for higher-order polynomials, we found this hard to observe in numerical simulations.

First, as predicted in Definition 1, the times at which the components are learned are closer on a logarithmic scale as the degree increases. It is therefore increasingly difficult to observe time scales corresponding to higher degrees.

Second, we expect there to be a choice of the initialization $(a_{i,\rm{init}},u_{i,\rm{init}})_{i\in[m]}$ , activation and target function, for which not all the components of $\varphi$ are actually learnt. We observed empirically that this happens easily for small $m$ .

6 Timescales hierarchy in the gradient flow dynamics

We are interested in the behavior of the solution of the ODEs (35), initialized from $s(\omega,0)=0$ for all $\omega$ (as per Proposition 2). The standard learning scenario of Definition 1 concerns the behavior of solutions for $\varepsilon\to 0$ . This type of questions can be addressed within the theory of dynamical systems using singular perturbation theory (Holmes, 2013) (‘singular’ refers to the fact that $\varepsilon$ multiplies one of the highest-order derivatives).

As a side remark, we note that the system (35) can be seen as a slow-fast dynamical system, where the $a(\omega)$ ’s are the fast variables and the $s(\omega)$ ’s are the slow variables (Berglund, 2001). Formally, the time derivative of the $a(\omega)$ ’s is multiplied by a factor $(1/\varepsilon)$ . From a dynamical systems perspective, the present case is made complicated because of a bifurcation when the $s(\omega)$ ’s become non-zero.

The standard learning scenario provides a detailed description of this bifurcation. We will motivate this scenario using a classical, but non-rigorous, technique of singular perturbation theory, called the matched asymptotic expansion (Holmes, 2013, Chapter 2). This technique decomposes the approximation of the solution in several time scales on which a regular approximation holds. These time scales are traditionally called layers in the literature; however, we avoid this terminology due to the potential confusion with the layers of the neural network.

We will work mainly using the Hermite representation of the dynamical ODEs (35), which we write down for the reader’s convenience:

[TABLE]

Sections 6.1-6.3 respectively describe the first three time scales of the matched asymptotic expansion of (39). This gives, for each time scale, an approximation of the $a(\omega)$ , $s(\omega)$ . In Appendix B.2, we detail how these sections induce an evolution of the risk alternating plateaus and rapid decreases, and support the standing learning scenario of Definition 1. Finally, in Section 6.4, we conjecture the behavior on longer time scales.

Notations.

We denote $\mathds{1}$ the constant function $\mathds{1}:\omega\in\Omega\mapsto 1\in\mathbb{R}$ . Denote $\langle.,.\rangle_{L^{2}(\rho)}$ the dot product on $L^{2}(\rho)$ and $\|.\|_{L^{2}(\rho)}$ the associated norm. For $x\in L^{2}(\rho)$ , we denote $x_{\perp}$ the orthogonal projection of $x$ on the hyperplane $\mathds{1}^{\perp}$ of $L^{2}(\rho)$ of functions orthogonal to $\mathds{1}$ :

[TABLE]

We denote $a_{\text{init}}(\omega)=a(\omega,0)$ and thus $a_{\perp,\text{\rm{init}}}$ is the orthogonal projection of $a_{\text{init}}$ on $\mathds{1}^{\perp}$ .

6.1 First time scale: constant component

We define a “fast” time variable $t_{1}=t/\varepsilon$ and replace it in Eq. (39). We expand the solutions $a(\omega)$ and $s(\omega)$ in powers of $\varepsilon$ :

[TABLE]

where $a^{(0)}(\omega),a^{(1)}(\omega),a^{(2)}(\omega),\dots,s^{(0)}(\omega),s^{(1)}(\omega),s^{(2)}(\omega),\dots$ are implicitly functions of $t_{1}$ . They are initialized at

[TABLE]

to be consistent with the initial condition $a(\omega,t_{1}=0)=a(\omega,t=0)=a_{\rm{init}}(\omega)$ and $s(\omega,t_{1}=0)=s(\omega,t=0)=0$ .

We substitute the expansion in (39):

[TABLE]

The basic assumption of matched asymptotic expansions is that terms of the same order in $\varepsilon$ can be identified (with some limitations that we develop below). For now, let us identify terms of order $1=\varepsilon^{0}$ :

[TABLE]

From (51) and (43), we have $s^{(0)}(\omega)=0$ : time $t_{1}=O(1)\Leftrightarrow t=O(\varepsilon)$ is too short for the $s(\omega)$ to be of order $1$ .

Substituting $s^{(0)}(\omega)=0$ in (50), we obtain

[TABLE]

Recall that $\langle.,.\rangle_{L^{2}(\rho)}$ is the dot product on $L^{2}(\rho)$ , $\mathds{1}$ denotes the constant function $\mathds{1}:\omega\in\Omega\mapsto 1\in\mathbb{R}$ and $a_{\perp}$ is the orthogonal projection of $a$ on $\mathds{1}^{\perp}$ . Equation (52) can be rewritten as

[TABLE]

which gives after integration (using (42)):

[TABLE]

At this point, we have determined $a^{(0)}(\omega)$ and $s^{(0)}(\omega)$ , and thus $a(\omega)=a^{(0)}(\omega)+O(\varepsilon)$ and $s(\omega)=s^{(0)}(\omega)+O(\varepsilon)$ up to a $O(\varepsilon)$ precision, which is sufficient to obtain a $o(1)$ -approximation of the risk $\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}$ (see Section B.2). However, note that we could obtain more precise estimates by identifying higher-order terms in (44)-(49). For instance, identifying the $O(\varepsilon)$ terms in (47)-(49), we obtain $\partial_{t_{1}}s^{(1)}(\omega)=a^{(0)}(\omega)\sigma_{1}\varphi_{1}$ . This shows that the $s(\omega)$ become non-zero, though only of order $\varepsilon$ on the time scale $t_{1}\asymp 1$ ; the inner-layer weights develop an infinitesimal correlation with the true direction $u_{*}$ thanks to the linear component of $\sigma$ and $\varphi$ .

The approximation constructed above should be considered as valid on the time scale $t_{1}\asymp 1\Leftrightarrow t\asymp\varepsilon$ . The approximation breaks down when we reach a new time scale, at which the $s(\omega)$ are large enough for the $a(\omega)$ to be affected (at leading order) by the linear part of the functions. We detail the new time scale and its resolution in the next section.

6.2 Second time scale: linear component I

In this section, we seek a second, slower time scale, for which the behavior of the asymptotic expansion is different.

Identification of the scale.

Consider $t_{2}=\frac{t}{\varepsilon^{\gamma}}$ , where $\gamma<1$ is to be determined. We rewrite the system (39) using $t_{2}$ , and expand the solutions $a(\omega)$ and $s(\omega)$ :

[TABLE]

(Since within the previous time scale we obtained $s(\omega)=O(\varepsilon)$ , it is natural to assume $s^{(0)}(\omega)=0$ .)

Let us pause to comment on our method.

Similarly to what has been done in the previous time scale, we will substitute the expansions (54)-(55) in the equations (39) in order to compute the different terms in the expansion. However, this step also allows us to compute the exponents $\gamma$ and $\delta$ , that give respectively the new time scale and the size of the $s(\omega)$ ’s.

Note that we should have proceeded similarly for the first time scale, by introducing a first time variable $t_{1}=\frac{t}{\varepsilon^{\gamma^{\prime}}}$ , expanding $a(\omega),s(\omega)$ in powers $1,\varepsilon^{\delta^{\prime}},\varepsilon^{2\delta^{\prime}},\dots$ , and determining $\gamma^{\prime}$ and $\delta^{\prime}$ a posteriori. This would have led, indeed, to $\gamma^{\prime}=1$ and $\delta^{\prime}=1$ . However, for simplicity, we preferred to fix these values that are natural a priori.

Finally, note that the expansions (40)-(41) and (54)-(55) are different, because they are valid on different time scales. In fact, the only coherence conditions that we require below is that the expansions match in a joint asymptotic where $t_{1}=\frac{t}{\varepsilon}\to\infty$ and $t_{2}=\frac{t}{\varepsilon^{\gamma}}\to 0$ . We thus build different approximations for each one of the time scales, with some matching conditions; this justifies the name of matched asymptotic expansion.

We now return to our computations and substitute (54)-(55) in (39):

[TABLE]

and thus

[TABLE]

For the first time scale, we chose $\gamma=\delta=1$ , so that the terms of order $\varepsilon^{\delta}$ were negligible compared to $\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)$ in (56). This means that the linear components $\sigma_{1},\varphi_{1}$ of the functions had no effect on the $a(\omega)$ at leading order. We are now interested in a new time scale where $\varepsilon^{1-\gamma}\partial_{t_{2}}a^{(0)}(\omega)$ and $\varepsilon^{\delta}\sigma_{1}\varphi_{1}s^{(1)}(\omega)$ are of the same order, i.e., $1-\gamma=\delta$ ; then the linear components play a role in the dynamics.

Further, for $s^{(1)}(\omega)$ to be non-zero, we need both sides of (58) to be of the same order, thus $\delta=\gamma$ . Putting together, this gives $\gamma=\delta=1/2$ .

Derivation of the ODEs for this time scale.

Let us summarize equations. For $t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}$ and

[TABLE]

we have from (56)-(58):

[TABLE]

First, we identify the terms of order $1=\varepsilon^{0}$ :

[TABLE]

This means that the trajectory remains in the affine hyperplane such that $\varphi_{0}={\sigma_{0}}\int a^{(0)}(\nu)\mathrm{d}\rho(\nu)$ ; intuitively, that the constant part of $\varphi$ remains learned in this second time scale.

Second, we identify the terms of order $\varepsilon^{\nicefrac{{1}}{{2}}}$ in (59)-(61):

[TABLE]

In (63), the first term of the right hand side depends on the unknown higher-order terms $a^{(1)}(\nu)$ ; in fact, this is best interpreted as the Lagrange multiplier associated to the constraint (62). To eliminate this Lagrange multiplier, we use again the compact notations:

[TABLE]

and thus

[TABLE]

Matching.

The initialization of the ODEs (65)-(66) for the second time scale is determined by a classical procedure that matches with the previous time scale. In this paragraph, we denote $\underline{a},\underline{s}$ the approximation obtained in the first time scale (Section 6.1), and $\overline{a},\overline{s}$ the approximation in the second time scale, described above.

Consider an intermediate time scale $\widetilde{t}=\frac{t}{\varepsilon^{\alpha}}$ , $\nicefrac{{1}}{{2}}<\alpha<1$ , and assume $\widetilde{t}\asymp 1$ so that

[TABLE]

In this intermediate regime, we want the approximations provided on the first and the second time scales to match: $\underline{a}(\widetilde{t})$ and $\overline{a}(\widetilde{t})$ (resp. $\underline{s}(\widetilde{t})$ and $\overline{s}(\widetilde{t})$ ) should match to leading order.

From the first time scale approximation,

[TABLE]

From the second time scale approximation,

[TABLE]

By matching, Equations (73) and (75) should be coherent. Thus the ODE for the second time scale should be initialized from $\overline{a}^{(0)}(0)=\frac{\varphi_{0}}{\sigma_{0}}\mathds{1}+a_{\perp,\rm{init}}$ .

Similarly, the matching procedure gives that the ODE for the second time scale should be initialized from $\overline{s}^{(1)}=0$ .

Solution.

As we are done with the matching procedure, we now consider the solution in the second time scale only, that we denote again by $a$ , $s$ as in (65), (66). The matching procedure motivates us to consider the solution of (67)-(68) initialized at $a_{\perp}^{(0)}(0)=a_{\perp,\rm{init}}$ , $s_{\perp}^{(1)}=0$ . This gives

[TABLE]

To conclude, we note that $\langle a^{(0)},\mathds{1}\rangle_{L^{2}(\rho)}=\frac{\varphi_{0}}{\sigma_{0}}$ is constrained by (62). Further, from (64),

[TABLE]

thus $\langle s^{(1)},\mathds{1}\rangle_{L^{2}(\rho)}=\sigma_{1}\varphi_{1}\frac{\varphi_{0}}{\sigma_{0}}t_{2}$ .

Putting together, these equations give:

[TABLE]

We observe that $a^{(0)}$ and $s^{(1)}$ diverge as $t_{2}\to\infty$ . This implies that our approximation on the second time scale must break down at a certain point. Indeed, we analyzed this time scale under the assumption that both $a^{(0)}$ and $s^{(1)}$ are of order $1$ . However, since $a^{(0)}$ and $s^{(1)}$ diverge exponentially as $t_{2}\to\infty$ , as per Eq. (76), this assumption breaks down when $t_{2}\asymp\log(1/\varepsilon)$ .

More precisely, in (59) (resp. (61)), the $O(\varepsilon)$ term includes a term of the form

[TABLE]

When $a^{(0)}$ and $s^{(1)}$ become of order $\varepsilon^{-\nicefrac{{1}}{{4}}}$ , this term becomes of order $\varepsilon^{\nicefrac{{1}}{{4}}}$ , which is then of the same order as the term $\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi_{1}s^{(1)}(\omega)$ in (59) (resp. the term $\varepsilon^{\nicefrac{{1}}{{2}}}\sigma_{1}\varphi_{1}a^{(0)}(\omega)$ in (61)). At this point, these terms can not be neglected anymore. From (76), we have

[TABLE]

Therefore, $a^{(0)}$ and $s^{(1)}$ become of order $\varepsilon^{-\nicefrac{{1}}{{4}}}$ at the time $t_{2}\sim\frac{1}{4|\sigma_{1}\varphi_{1}|}\log\frac{1}{\varepsilon}$ , at which the approximation on the second time scale breaks down. We thus introduce a new time scale centered at this critical point.

6.3 Third time scale: linear component II

We now introduce the time $t_{3}=t_{2}-\frac{1}{4|\varphi_{1}\sigma_{1}|}\log\frac{1}{\varepsilon}$ . As $t_{3}$ is only a translation from $t_{2}$ , the ODEs in terms of $t_{3}$ are the same as the ones in term of $t_{2}$ . However, in this time scale, $a$ and $\varepsilon^{\nicefrac{{1}}{{2}}}s$ have diverged. In coherence with the discussion above, we seek expansions of the form

[TABLE]

Similarly to the second time scale, we substitute (77)-(78) in (39) and obtain

[TABLE]

First, we identify the terms of order $\varepsilon^{-\nicefrac{{1}}{{4}}}$ :

[TABLE]

This means that $a$ has no component diverging in $\varepsilon$ in the direction of $\mathds{1}$ .

Second, we identify the terms of order $1=\varepsilon^{0}$ :

[TABLE]

Put together with (79), this equation ensures that the constant component of $\varphi$ remains learned on this third time scale.

Third, we identify the terms of order $\varepsilon^{\nicefrac{{1}}{{4}}}$ :

[TABLE]

Again, the term $-{\sigma_{0}^{2}}\int a^{(1)}(\nu)\mathrm{d}\rho(\nu)$ is best interpreted as the Lagrange multiplier associated to the constraints (79), (80). Using the compact notations,

[TABLE]

where in the last equality we use (79). Thus we can rewrite (81) as

[TABLE]

and thus

[TABLE]

In Appendix B.1, we solve this system of ODEs and determine the initial condition by matching with the previous layer. The result is that

[TABLE]

where $\lambda=\lambda(t_{3})$ is the function

[TABLE]

This solution finishes to describe how the linear part of the function $\varphi$ is learned.

6.4 Conjectured behavior for larger time scales

The analysis of the previous sections naturally suggests the existence of a sequence of cutoffs. At each time scale, a new polynomial component of $\varphi$ is learned within a window that is much shorter than the time elapsed before that phase started. Along this sequence, we expect $s$ and $a$ to grow to increasingly larger scales in $\varepsilon$ (but $s$ remains $o(1)$ while $a$ diverges).

More precisely, we assume that during the $l$ -th phase, the network learns the degree- $l$ component $\varphi_{l}$ , and various quantities satisfy the following scaling behavior:

[TABLE]

where $\omega_{l}>0$ is an increasing sequence and $\beta_{l},\mu_{l}>0$ are decreasing sequences. Further, while learning of this component takes place when $t=O(\varepsilon^{\mu_{l}})$ , the actual evolution of the risk (and of the neural network) take place on much shorter scales, namely:

[TABLE]

where $\nu_{l}$ is also decreasing, with $\nu_{l}>\mu_{l}$ . The goal of this section is to provide heuristic arguments to conjecture the values of $\omega_{l}$ , $\beta_{l}$ , $\mu_{l}$ and $\nu_{l}$ . We will base this conjecture on a rigorous analysis of a simplified model.

The simplified model is motivated by the expectation (supported by the heuristics and simulations in the previous sections) that learning each component happens independently from the details of the evolution on previous time scales. In the simplified model, the activation function $\sigma(x)$ is proportional to the $l$ -th Hermite polynomial, namely $\sigma(x)=\sigma_{l}\mathrm{He}_{l}(x)$ . This is the component of $\sigma$ that we expect to be relevant on the $l$ -th time scale. The gradient flow equations (39) then read:

[TABLE]

with corresponding risk component

[TABLE]

We capture the effect of learning dynamics on the previous time scales by the overall magnitude of the $a(\omega)$ ’s and $s(\omega)$ ’s at initialization. Namely, we choose the scale of initialization of the simplified model to be given by the end of the $(l-1)$ -th time scale, i.e., $a(\omega)\asymp\varepsilon^{-\omega_{l-1}}$ and $s(\omega)\asymp\varepsilon^{\beta_{l-1}}$ . Further, in order for the $(l-1)$ -th component to be learned, namely

[TABLE]

we require $\omega_{l-1}=(l-1)\beta_{l-1}$ so that $\int a(\nu)s(\nu)^{l-1}\mathrm{d}\rho(\nu)=\Theta(1)$ . Analogously, we assume $\omega_{l}=l\beta_{l}$ .

Based on this consideration, we introduce the rescaled variables

[TABLE]

Rewriting Eq. (88) in terms of $\widetilde{a}(\omega)$ ’s and $\widetilde{s}(\omega)$ ’s, and using $\omega_{l}=l\beta_{l}$ , we get that

[TABLE]

In order for the $\widetilde{a}(\omega)$ ’s and $\widetilde{s}(\omega)$ ’s to be learned simultaneously, we need $1-2l\beta_{l}=2\beta_{l}$ , which implies $\beta_{l}=1/2(l+1)$ . Making a further change of the time variable $t=\varepsilon^{\nu_{l}}\tau$ , where $\nu_{l}=2\beta_{l}=1/(l+1)$ , it follows that

[TABLE]

Moreover, rewriting the risk in terms of the rescaled variables $\widetilde{a},\widetilde{s}$ , $\mathscrsfs{R}_{l}(\tau)=\mathscrsfs{R}_{l}(\widetilde{a}(\tau),\widetilde{s}(\tau))$ satisfies the ODE:

[TABLE]

Note that with our choice of $\beta_{l}$ and $\omega_{l}$ , we have $\omega_{l}-\omega_{l-1}=\beta_{l-1}-\beta_{l}=1/2l(l+1)$ . This means that the $\widetilde{a}(\omega)$ ’s and $\widetilde{s}(\omega)$ ’s are initialized at the same scale, namely

[TABLE]

The theorem below describes quantitatively the dynamics of the simplified model for small $\varepsilon$ , and determines the value of $\mu_{l}$ (recall that $\nu_{l}=1/(l+1)$ ):

Theorem 1 (Evolution of the simplified gradient flow).

Assume $l\geq 2$ and let $(\widetilde{a}(\omega,\tau),\widetilde{s}(\omega,\tau))_{\tau\geq 0}$ be the unique solution of the ODE system (91), initialized as per Eq. (93) (note in particular that $\sigma_{l}\varphi_{l}\widetilde{a}(\omega,0)\widetilde{s}(\omega,0)^{l}\asymp\varepsilon^{1/2l}$ ). Then the followings hold:

$(a)$

Let us denote

[TABLE]

and assume $\rho(A)>0$ . For $\Delta\in(0,\varphi_{l}^{2}/2)$ , define

[TABLE]

Then, for any fixed $\Delta$ we have $\tau(\Delta)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})$ as $\varepsilon\to 0$ . Further, if $\rho$ is a discrete probability measure, then there exists $\tau_{*}(\varepsilon)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})$ and, for any $\Delta>0$ a constant $c_{*}(\Delta)>0$ independent of $\varepsilon$ such that

[TABLE]

namely the $l$ -th component is learnt in an $O(1)$ time window around $\tau_{*}(\varepsilon)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})$ .

$(b)$

Similarly, we denote

[TABLE]

If $\rho(B)>0$ , then the same claims as in $(a)$ hold.

$(c)$

If neither of the conditions at points $(a)$ , $(b)$ holds, and

[TABLE]

for almost every $\omega\in\Omega$ . Then, for such $\omega\in\Omega$ and each $\Delta>0$ , there exists a constant $C_{*}(\omega,\Delta)>0$ such that

[TABLE]

meaning that $\widetilde{s}(\omega,\tau)$ converges to [math] eventually.

We further note that $\tau=\Theta(\varepsilon^{-(l-1)/2l(l+1)})\Longleftrightarrow t=\Theta(\varepsilon^{\mu_{l}})$ with $\mu_{l}=1/2l$ , and $\tau=O(1)\Longleftrightarrow t=O(\varepsilon^{\nu_{l}})$ with $\nu_{l}=1/(l+1)$ .

The proof of Theorem 1 is deferred to Appendix B.3.

Remark 6.1.

Under the conditions of cases $(a)$ and $(b)$ , we see that the degree- $l$ component of the target function is learnt within an $O(\varepsilon^{1/(l+1)})$ time window around $t_{*}(l,\varepsilon)\asymp\varepsilon^{1/2l}$ , which is consistent with the timescales conjectured in Definition 1.

Remark 6.2.

Case $(c)$ corresponds to $s(\omega)/s(\omega,0)$ becoming close to [math] in time $t=O(\varepsilon^{\mu_{l}})$ , and staying at [math]. In other words, the neurons become orthogonal to the target direction and play no role in learning higher-degree components any longer.

Informally, case $(c)$ couples the learning of different polynomial components. It can happen that the learning phase $l-1$ induces an effective initialization $(\widetilde{a}(\omega,0),\ \widetilde{s}(\omega,0))$ within the domain of case $(c)$ .

We expect this not to be the case for suitable choices of initialization (or equivalently ${\rm P}_{A}$ ), $\varphi$ , and $\sigma$ . Establishing this would amount to establishing that the standard learning scenario holds.

7 Stochastic gradient descent and finite sample size

So far we focused on analyzing the projected gradient flow (GF) dynamics with respect to the population risk, as defined in Eqs. (4)-(5). In this section, we extract the implications of our analysis of GF on online projected stochastic gradient descent, which is a projected version of the SGD dynamics (151).

For simplicity of notation, we denote by $z=(y,x)\in{\mathbb{R}}\times{\mathbb{R}}^{d}$ a datapoint and by $\theta_{i}=(a_{i},u_{i})\in{\mathbb{R}}\times\mathbb{S}^{d-1}$ the parameters of neuron $i$ . For $z=(y,x)$ and $\rho^{(m)}=(1/m)\sum_{i=1}^{m}\delta_{\theta_{i}}=(1/m)\sum_{i=1}^{m}\delta_{(a_{i},u_{i})}$ , we define

[TABLE]

The projected SGD dynamics is specified as follows:

[TABLE]

where for $u\in\mathbb{R}^{d}$ and compact $S\subset\mathbb{R}^{d}$ , $\operatorname{Proj}_{S}(u):=\operatorname*{argmin}_{s\in S}\left\|{s-u}\right\|_{2}$ , and $\overline{\rho}^{(m)}:=(1/m)\sum_{i=1}^{m}\delta_{\overline{\theta}_{i}}$ . Note that the $(\overline{a}_{i},\overline{u}_{i})$ ’s here are different from the $(\overline{a},\overline{s})$ ’s in Section 6.

We prove that, for small $\eta$ , the projected SGD of Eq. (101) is close to the gradient flow of Eqs. (4)-(5). Throughout this section, we make the following assumptions similar to those assumed in Section 4:

A1.

$\rho_{0}$ is supported on $[-M_{1},M_{1}]\times\mathbb{S}^{d-1}$ . Hence, $|a_{i}(0)|\leq M_{1}$ for all $i\in[m]$ .

A2.

The activation function is bounded: $\left\|{\sigma}\right\|_{\infty}\leq M_{2}$ . Additionally, define for $u,u^{\prime}\in\mathbb{R}^{d}$ :

[TABLE]

We then require the functions $V$ and $U$ to be bounded and differentiable, with uniformly bounded and Lipschitz continuous gradients for all $\left\|{u}\right\|_{2},\left\|{u^{\prime}}\right\|_{2}\leq 2$ :

[TABLE]

Similar to Remark 4.1, we can show that a sufficient condition for Eq.s (104) and (105) is

[TABLE]

where the constant $M_{2}^{\prime}$ depends uniquely on $M_{2}$ .

A3.

Assume $(x,y)\sim\mathds{P}$ , then we require that $y\in[-M_{3},M_{3}]$ almost surely. Moreover, we assume that for all $\left\|{u}\right\|_{2}\leq 2$ , both $\sigma(\langle u,x\rangle)$ and $\sigma^{\prime}(\langle u,x\rangle)(x-\langle u,x\rangle u)$ are $M_{3}$ -sub-Gaussian.

The following theorem upper bounds the distance between gradient flow and projected stochastic gradient descent dynamics.

Theorem 2 (Difference between GF and Projected SGD).

Let $\theta_{i}(t)=(a_{i}(t),u_{i}(t))$ be the solution of the GF ordinary differential equations (4)-(5). There exists a constant $M$ that only depends on the $M_{i}$ ’s from Assumptions A1-A3, such that for any $T,z\geq 0$ and

[TABLE]

the following holds with probability at least $1-\exp(-z^{2})$ :

[TABLE]

The proof is presented in Appendix C and follows the same scheme as in that of Theorem 1 part (B) in (Mei et al., 2019). The main difference with respect to that theorem is here we are interested in projected SGD (and GF) instead of plain SGD (and GF), hence an additional step of approximation is required, and the $a_{i}$ ’s and $u_{i}$ ’s need to be treated separately. We next draw implications of the last result on learning by online SGD within the standard learning scenario.

Theorem 3.

Fix any $\delta>0$ . Assume $\varphi,\sigma$ and the initialization ${\rm P}_{A}$ be such that the standard learning scenario of Definition 1 holds up to level $L$ for some $L\geq 2$ , and that

[TABLE]

Then, there exist constants $\varepsilon_{*}=\varepsilon_{*}(\delta)$ , $T_{0}=T_{0}(\delta)$ , $T=T(\varepsilon,\delta)=T_{0}(\delta)\varepsilon^{1/(2L)}$ and $M=M(\varepsilon,\delta)$ that depend on $\varepsilon,\delta$ (together with $\varphi,\sigma$ and ${\rm P}_{A}$ ) such that the following happens. Assume $\varepsilon\leq\varepsilon_{*}(\delta)$ and $m,d,z$ are such that $d\geq M$ , $m\geq\max(M,z)$ , and the step size $\eta$ and number of samples (equivalently, number of steps) $n$ satisfy

[TABLE]

Then, with probability at least $1-e^{-z}$ , the projected gradient descent algorithm of Eq. (101) achieves population risk smaller than $\delta$ :

[TABLE]

The proof of Theorem 3 is deferred to Appendix C.4.

Remark 7.1.

Within the lazy or neural tangent regime, learning the projection of the target function $\varphi(\langle u_{*},x\rangle)$ onto polynomials of degree $\ell$ requires $n\gg d^{\ell}$ samples, and $m\gg d^{\ell-1}$ neurons (Ghorbani et al., 2021; Mei et al., 2022; Montanari and Zhong, 2022).

In contrast, Theorem 3 shows that, within the standard learning scenario, $O(d)$ samples and $O(1)$ neurons are sufficient. Further as per Theorem 2, the learning dynamics is accurately described by the GF analyzed in the previous sections.

Acknowledgments

This work was supported by the NSF through award DMS-2031883, the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning, the NSF grant CCF-2006489 and the ONR grant N00014-18-1-2729, and a grant from Eric and Wendy Schmidt at the Institute for Advanced Studies. Part of this work was carried out while Andrea Montanari was on partial leave from Stanford and a Chief Scientist at Ndata Inc dba Project N. The present research is unrelated to AM’s activity while on leave.

Appendix A Appendix to Section 4

A.1 Proof of Proposition 1

When $x\sim\mathsf{N}(0,I_{d})$ and $u,u^{\prime}\in\mathbb{S}^{d-1}$ , $\begin{pmatrix}\langle u,x\rangle\\ \langle u^{\prime},x\rangle\end{pmatrix}\sim\mathsf{N}\left(0,\begin{pmatrix}1&\langle u,u^{\prime}\rangle\\ \langle u,u^{\prime}\rangle&1\end{pmatrix}\right)$ . Thus

[TABLE]

This proves (14). Equation (15) follows directly:

[TABLE]

To obtain equations (16)-(18), we now take gradients in (113):

[TABLE]

Thus

[TABLE]

This gives (16). Finally, we perform a similar computation to compute $\partial_{t}r_{ij}=\langle\partial_{t}u_{i},u_{j}\rangle+\langle u_{i},\partial_{t}u_{j}\rangle$ . We compute only the first term, as the second term can be obtained by inverting $i$ and $j$ :

[TABLE]

Adding the symmetric term $\langle u_{i},\partial_{t}u_{j}\rangle$ , we obtain (17)-(18).

A.2 Proof of Corollary 1

First, note that in the proof of Lemma 1, we obtain the following a priori estimate on the magnitude of the $a_{i}^{0}$ ’s:

[TABLE]

where $M$ only depends on the $M_{i}$ ’s in Assumptions A1-A3. Using a similar argument as that in the proof of Proposition 2, we obtain that for any $t\in[0,T]$ and $i\in[m]$ ,

[TABLE]

and for $1\leq i\neq j\leq m$ ,

[TABLE]

Therefore, we deduce that

[TABLE]

Defining

[TABLE]

then we know that $G^{\prime}(t)\leq(M(1+t)^{2}/\varepsilon^{2})G(t)$ . Applying Grönwall’s inequality yields

[TABLE]

Since $\{\langle u_{i}(0),u_{*}\rangle\}_{i\in[m]}\sim_{\mathrm{i.i.d.}}{\mathcal{N}}(0,1/d)$ and for any $i\in[m]$ , $\{\langle u_{i}(0),u_{j}(0)\rangle\}_{j\neq i}\sim_{\mathrm{i.i.d.}}{\mathcal{N}}(0,1/d)$ . Using standard concentration inequalities, we know that

[TABLE]

with probability at least $1-\exp(C^{\prime}m)$ , where $C$ and $C^{\prime}$ are both absolute constants. Therefore,

[TABLE]

Next we upper bound the risk difference, by direct calculation,

[TABLE]

with probability at least $1-\exp(-C^{\prime}m)$ , where the constant $M$ only depends on the $M_{i}$ ’s from Assumptions A1-A3. The conclusion now follows from taking the supremum over all $t\in[0,T]$ . This completes the proof of Corollary 1.

A.3 Proof of Proposition 2

We consider $r_{ij}^{\perp}=r_{ij}-s_{i}s_{j}=\langle u_{i},u_{j}\rangle-\langle u_{i},u_{*}\rangle\langle u_{*},u_{j}\rangle$ , the dot product between $u_{i}$ and $u_{j}$ that is out of the relevant subspace spanned by $u_{*}$ . We show that these variables satisfy the ODEs

[TABLE]

By definition of $r_{ij}^{\perp}$ , we readily see that

[TABLE]

Plugging in Eq.s (16) to (18) gives that

[TABLE]

This proves Eq. (119).

Lemma 1.

If Assumptions A1-A3 hold, then we have for any fixed $T>0$ :

[TABLE]

Proof.

To begin with, using Eq. (119), we obtain that

[TABLE]

Using the ODEs for the $a_{i}$ ’s, we obtain that

[TABLE]

where $(i)$ follows from our assumptions and the fact that $\mathscrsfs{R}(a(t),u(t))\leq\mathscrsfs{R}(a(0),u(0))$ , since $\partial_{t}\mathscrsfs{R}(a,u)\leq 0$ by gradient flow equations. Moreover, the constant $M$ only depends on the $M_{i}$ ’s. Since $|a_{i}(0)|\leq M_{1}$ for all $i\in[m]$ , we know that $|a_{i}(t)|\leq M(1+t/\varepsilon)$ for all $t\geq 0$ , thus leading to the following estimate:

[TABLE]

where the constant $M$ only depends on the $M_{i}$ ’s in our assumptions. At initialization, we know that $\sum_{i,j=1}^{m}r_{ij}^{\perp}(0)^{2}=m$ . Applying Grönwall’s inequality yields that

[TABLE]

which further implies that

[TABLE]

This completes the proof. ∎

We show that

[TABLE]

To this end, we define $S(t)=\sum_{i=1}^{m}\left(\left(a_{i}(t)-a_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}+\left(s_{i}(t)-s_{i}^{\mbox{\tiny\rm mf}}(t)\right)^{2}\right)$ . By our assumption, $S(0)=0$ . Moreover, using the same technique as in the proof of Lemma 1, we know that $|a_{i}^{\mbox{\tiny\rm mf}}(t)|\leq M(1+t/\varepsilon)$ for all $i\in[m]$ . According to Eq.s (15)-(18) and Eq. (23), we deduce that

[TABLE]

thus leading to the following estimate:

[TABLE]

where in $(i)$ we use the Cauchy-Schwarz inequality and the inequality of arithmetic and geometric means, and $(ii)$ follows from the conclusion of Lemma 1. Similarly, we obtain that

[TABLE]

which further implies that

[TABLE]

Combining the above estimates, we finally deduce that

[TABLE]

Applying Grönwall’s inequality immediately implies

[TABLE]

which further leads to Eq. (120) and concludes the proof of Proposition 2. The “consequently” part can be shown via direct calculation, but we include it here for the sake of completeness. By definition, for any $t\in[0,T]$ we have

[TABLE]

Therefore,

[TABLE]

as desired.

A.4 Derivation of the mean field dynamics (28)

For any bounded continuous $f\in C_{b}(\mathbb{R}^{2})$ , we have

[TABLE]

where $(i)$ follows from the ODE satisfied by the $(a_{i}^{\mbox{\tiny\rm mf}}(t),s_{i}^{\mbox{\tiny\rm mf}}(t))$ ’s, and in $(ii)$ we use integration by parts. We thus obtain that

[TABLE]

which recovers Eq. (28).

A.5 Details of the alternative mean field approach

Let

[TABLE]

where $(a_{i}(t),u_{i}(t))_{1\leqslant i\leqslant m}$ is the solution of (4)–(5). $\overline{\rho}_{t}$ is a measure on $\mathbb{R}\times\mathbb{S}^{d-1}$ solving the continuity PDE

[TABLE]

where $\overline{\Psi}=(\overline{\Psi}_{a},\overline{\Psi}_{u})$ is given by

[TABLE]

A remarkable property of the equation (124) is that it preserves invariance to rotations orthogonal to $u_{*}$ . Indeed, assume that $\overline{\rho}$ is invariant to rotations orthogonal to $u_{*}$ . In this case, we show that $\overline{\Psi}_{a}\left(a,u;\overline{\rho}\right)$ and $\langle u_{*},\overline{\Psi}_{u}\left(a,u;\overline{\rho}\right)\rangle$ depend only on $s:=\langle u,u_{*}\rangle$ and $s_{1}:=\langle u_{1},u_{*}\rangle$ . Let $u^{\perp}$ (resp. $u_{1}^{\perp}$ ) denote the component of $u$ (resp. $u_{1}$ ) orthogonal to $u_{*}$ . Let $R$ denote a random uniform rotation orthogonal to $u_{*}$ . By the rotation invariance of $\overline{\rho}$ ,

[TABLE]

The random variable $B^{(d)}=\left\langle\frac{u^{\perp}}{\|u^{\perp}\|},R\frac{u_{1}^{\perp}}{\|u_{1}^{\perp}\|}\right\rangle$ is a one dimensional projection of a random variable uniform on the unit sphere of the hyperplane orthogonal to $u_{*}$ ; thus it has the density $p_{B^{(d)}}(b)\propto(1-b^{2})^{d/2-2}$ (see, e.g., [Frye and Efthimiou, 2012, Lemma 4.17]). Denote

[TABLE]

then we have

[TABLE]

Further, we compute

[TABLE]

In the equation above, we have $\langle u_{*},(I_{d}-uu^{\top})s_{1}u_{*}\rangle=s_{1}(1-s^{2})$ and as $\langle u_{*},Ru_{1}^{\perp}\rangle=0$ a.s., we have

[TABLE]

Thus we obtain

[TABLE]

Note that

[TABLE]

and thus we have

[TABLE]

Of course, a discrete measure of the form (123) can not be invariant to rotations orthogonal to $u_{*}$ . However, if the $u_{i}$ are initialized uniformly on the unit sphere, then the measure $\overline{\rho}_{0}$ converges to a measure with the rotation invariance as $m\to\infty$ . One can then apply the results of Mei et al. [2019] to control the deviations from this limit. Let us thus assume that $\overline{\rho}_{0}$ satisfies the rotation invariance. Define the map $\varphi(a,u)=(a,\langle u,u_{*}\rangle)$ . Then, from (125), (126), the push-forward $\rho_{t}$ of $\overline{\rho}_{t}$ through the map $\varphi$ satisfies the continuity equation

[TABLE]

where $\Psi^{(d)}=(\Psi^{(d)}_{a},\Psi^{(d)}_{s})$ is given by

[TABLE]

When $d\to\infty$ , $p_{B^{(d)}}(b)\mathrm{d}b\propto(1-b^{2})^{d/2-2}\mathrm{d}b$ converges weakly to the Dirac mass $\delta_{0}(\mathrm{d}b)$ . As a consequence,

[TABLE]

As a consequence, in the limit $d\to\infty$ , we recover the equations (28)–(31). Moreover, if $\overline{\rho}_{0}={\rm P}_{A}\otimes\mathrm{Unif}(\mathbb{S}^{d-1})$ , then $\rho_{0}$ converges weakly to ${\rm P}_{A}\otimes\delta_{0}(\mathrm{d}s)$ as $d\to\infty$ .

Appendix B Calculations for the analysis of mean-field gradient flow

B.1 Solution of Eq. (83)

In order to solve the system (83), we start from an associated one-dimensional ODE.

Lemma 2.

The solution $\lambda=\lambda(t_{3})$ of the ODE

[TABLE]

with initial condition $\lambda(0)$ is

[TABLE]

Proof.

For simplicity, denote $\alpha=|\sigma_{1}|$ , $\beta=|\varphi_{1}|$ and $\gamma={|\sigma_{1}|}\left\|a_{\perp,\rm{init}}\right\|_{L^{2}(\rho)}^{2}$ . Then

[TABLE]

This is Bernoulli differential equation (see, e.g., Encyclopedia of Mathematics ). In this situation, the classical trick is to reduce the problem to a linear inhomogeneous first-order equation by considering

[TABLE]

Solving this linear inhomogeneous first-order equation gives

[TABLE]

and thus

[TABLE]

which is the claimed result. ∎

Let $\lambda=\lambda(t_{3})$ be a solution of (127) and consider

[TABLE]

Then $a^{(-1)},s^{(1)}$ are solutions of the constrained ODE system (79), (82). Indeed,

[TABLE]

thus the constraint (79) is satisfied. Further

[TABLE]

A similar computation shows that the differential equation for $s^{(1)}$ is also satisfied. This concludes that (129) is a valid candidate to solve the third time scale.

Matching.

To determine the value of the initialization $\lambda(0)$ we perform a matching procedure with the previous time scale. In this paragraph, we denote $\underline{a},\underline{s}$ the approximation obtained in the second time scale (Section 6.2), and $\overline{a},\overline{s}$ the approximation in the third time scale (Section 6.3 and above).

Consider an intermediate time scale $\widetilde{t}=t_{2}-c\log\frac{1}{\varepsilon}$ with $0<c<\frac{1}{4|\sigma_{1}\varphi_{1}|}$ . Assume $\widetilde{t}\asymp 1$ . Then

[TABLE]

From the approximation (76) on the second time scale,

[TABLE]

From the approximation on the third time scale,

[TABLE]

Note that as $t_{3}\to-\infty$ , from (128),

[TABLE]

Thus

[TABLE]

By matching, Equations (130) and (131) should be coherent. This gives

[TABLE]

and thus

[TABLE]

One could check similarly that $s^{(1)}$ also satisfies the matching conditions under the same constraint, and thus that (129) are indeed the solutions of the third time scale.

B.2 Induced approximation of the risk

In this section, we show that the behavior of $a$ and $s$ derived in Sections 6.1–6.3 leads to an evolution of the risk alternating plateaus and rapid decreases, in agreement with the standard scenario of Definition 1. For the convenience of the reader, we recall the expression (36) of the risk

[TABLE]

First time scale $t_{1}=\frac{t}{\varepsilon}$ (Section 6.1).

On this time scale, we have $a=O(1)$ and $s=O(\varepsilon)$ . Thus for all $k\geqslant 1$ , $\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)=O(\varepsilon)$ whence $\left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)\right)^{2}=\varphi_{k}^{2}+O(\varepsilon)$ .

Further, using (53),

[TABLE]

Thus as $\varepsilon\to 0$ ,

[TABLE]

This describes, in a more detailed form, the first transition in Definition 1.

Second time scale $t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}$ (Section 6.2).

On this time scale, we have $a=O(1)$ and $s=O(\varepsilon^{\nicefrac{{1}}{{2}}})$ . Thus for all $k\geqslant 1$ , $\left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)\right)^{2}=\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{2}}})$ .

Further, using (62),

[TABLE]

Thus as $\varepsilon\to 0$ ,

[TABLE]

This second time scale does not induce any transition of the risk $\mathscrsfs{R}_{\mbox{\tiny\rm mf},*}$ (but was necessary to understand the divergence of $a$ and $\varepsilon^{-\nicefrac{{1}}{{2}}}s$ ).

Third time scale $t_{3}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}-\frac{1}{4|\sigma_{1}\varphi_{1}|}\log\frac{1}{\varepsilon}$ (Section 6.3).

On this time scale, we have $a=O(\varepsilon^{-\nicefrac{{1}}{{4}}})$ and $s=O(\varepsilon^{\nicefrac{{1}}{{4}}})$ . Thus for all $k\geqslant 2$ , $\left(\varphi_{k}-\sigma_{k}\int a(\omega)s(\omega)^{k}\mathrm{d}\rho(\omega)\right)^{2}=\varphi_{k}^{2}+O(\varepsilon^{\nicefrac{{1}}{{4}}})$ .

Further, using (79), (80),

[TABLE]

Finally, using (84), (85),

[TABLE]

where in $(a)$ we used (84) and in $(b)$ (85). Thus as $\varepsilon\to 0$ ,

[TABLE]

This describes, in a more detailed form, the second transition in Definition 1.

B.3 Proof of Theorem 1

Throughout the proof, we will use the shorthand $\mathscrsfs{R}_{l}(\tau)$ to represent $\mathscrsfs{R}_{l}(\widetilde{a}(\tau),\widetilde{s}(\tau))$ . First, note that according to the ODE satisfied by $\mathscrsfs{R}_{l}$ (Eq. (92)), we know that $\mathscrsfs{R}_{l}$ must be non-increasing, thus for small enough $\varepsilon>0$ ,

[TABLE]

Hence, we obtain the estimates:

[TABLE]

According to the comparison theorem for system of ODEs, we know that $|\widetilde{a}(\omega,\tau)|\leq\widehat{a}(\omega,\tau)$ , $|\widetilde{s}(\omega,\tau)|\leq\widehat{s}(\omega,\tau)$ for all $\tau\geq 0$ where

[TABLE]

and

[TABLE]

The above system of ODEs can be solved analytically via integration. First, we note that

[TABLE]

which implies that (further note $\widehat{s}(\omega,0)^{2}=l\widehat{a}(\omega,0)^{2}$ )

[TABLE]

The ODE system then reduces to $\partial_{\tau}\widehat{a}(\omega)=2l^{l/2}|\sigma_{l}||\varphi_{l}|\widehat{a}(\omega)^{l}$ , which admits the solution

[TABLE]

Since $\widehat{a}(\omega,0)=\Theta(\varepsilon^{1/2l(l+1)})$ , we know that $\widehat{a}(\omega,\tau),\widehat{s}(\omega,\tau)=o(1)$ until $\tau=\Theta(\varepsilon^{-(l-1)/2l(l+1)})-O(1)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})$ , which means that $\widetilde{a}(\omega,\tau),\widetilde{s}(\omega,\tau)=o(1)$ until $\tau=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ . As a consequence,

[TABLE]

until $\tau=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ . This means that the learning of the $l$ -th component will not begin until $\tau=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ , namely $\tau(\Delta)=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ for any fixed $\Delta>0$ . Note that the above argument applies to all of the settings in the theorem statement.

Next, we show that for any fixed $\Delta>0$ , $\tau(\Delta)=O(\varepsilon^{-(l-1)/2l(l+1)})$ , which means that the $l$ -th component can be learnt in $O(\varepsilon^{-(l-1)/2l(l+1)})$ time. To prove our claim by contradiction, assume that there exists $\Delta>0$ and a sequence $\varepsilon_{k}\downarrow 0$ , such that

[TABLE]

By definition of $\tau(\Delta)$ , we know that $\forall\tau\leq\tau(\Delta)$ ,

[TABLE]

Now, assume the condition of setting (a) holds and denote

[TABLE]

Then by definition and our assumption that $\widetilde{a}(\omega,0)$ is of the same order as $\widetilde{s}(\omega,0)$ , we know that $A=\cup_{\varepsilon_{0}>0,\eta>0}A_{\varepsilon_{0},\eta}$ . Since $\rho(A)>0$ , there exists $\varepsilon_{0},\eta>0$ such that $\rho(A_{\varepsilon_{0},\eta})>0$ . Note that here we can choose $\varepsilon_{0}$ and $\eta$ to be arbitrarily small since the set $A_{\varepsilon_{0},\eta}$ is non-increasing in $\varepsilon_{0}$ and $\eta$ . For $\omega\in A_{\varepsilon_{0},\eta}$ and $\tau\leq\tau(\Delta)$ , we have

[TABLE]

Moreover, we know that at initialization, $|\widetilde{a}(\omega,0)|,|\widetilde{s}(\omega,0)|>\eta\varepsilon^{1/2l(l+1)}$ . Using the ODE comparison theorem and a similar argument as that in proving $\tau(\Delta)=\Omega(\varepsilon^{-(l-1)/2l(l+1)})$ , we deduce that for sufficiently large $k$ such that $\varepsilon=\varepsilon_{k}<\varepsilon_{0}$ , there exist constants $C,C^{\prime}>0$ that does not depend on $\varepsilon$ satisfying the following: For all $\omega\in A_{\varepsilon_{0},\eta}$ and $\tau\geq C\varepsilon^{-(l-1)/2l(l+1)}$ ,

[TABLE]

This further implies that at time $\tau$ ,

[TABLE]

According to Eq. (92), we know that $\mathscrsfs{R}_{l}$ will decrease to [math] exponentially fast in an $O(1)$ time window after $\tau=C\varepsilon^{-(l-1)/2l(l+1)}$ , which contradicts our assumption (136). This proves that $\tau(\Delta)=O(\varepsilon^{-(l-1)/2l(l+1)})$ under setting (a). Next, we show that setting (b) can be reduced to setting (a). Under setting (b), let us denote

[TABLE]

Then similar to the previous argument, there exists $\varepsilon_{0},\eta>0$ such that $\rho(B_{\varepsilon_{0},\eta})>0$ , and further we can choose $\varepsilon_{0}$ and $\eta$ to be arbitrarily small. For $\omega\in B_{\varepsilon_{0},\eta}$ , we have

[TABLE]

Hence, both $\widetilde{a}(\omega)^{2}$ and $\widetilde{s}(\omega)^{2}$ will decrease at initialization. Moreover, Eq. (91) implies that

[TABLE]

Integrating both sides of the above equation, we obtain that

[TABLE]

which is close to $\left(\widetilde{s}(\omega,0)^{2}-\widetilde{s}(\omega,\tau)^{2}\right)/l$ as long as $\widetilde{s}(\omega,\tau)=O(1)$ . To be accurate, let us define

[TABLE]

then we know that $\widetilde{s}(\omega,\tau_{a,\omega})=\Omega(\varepsilon^{1/2l(l+1)})$ and $\tau_{a,\omega}=O(\varepsilon^{-(l-1)/2l(l+1)})$ under the assumption (136), where the latter claim can be proved through making the change of variable $\widetilde{a}^{\prime}(\omega)=\varepsilon^{-1/2l(l+1)}\widetilde{a}(\omega)$ and $\widetilde{s}^{\prime}(\omega)=\varepsilon^{-1/2l(l+1)}\widetilde{s}(\omega)$ . Note that after the time point $\tau_{a,\omega}$ , the sign of $\widetilde{a}(\omega)$ changes. Hence, $\varphi_{l}\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}>0$ , and $\widetilde{a}(\omega,\tau)^{2}$ and $\widetilde{s}(\omega,\tau)^{2}$ will begin to increase for $\tau\geq\tau_{a,\omega}$ . Similarly, we can show that in $O(\varepsilon^{-(l-1)/2l(l+1)})$ time after $\tau_{a,\omega}$ , both $\widetilde{a}(\omega)$ and $\widetilde{s}(\omega)$ become of order $\varepsilon^{1/2l(l+1)}$ , and we still have $\varphi_{l}\sigma_{l}\widetilde{a}(\omega)\widetilde{s}(\omega)^{l}>0$ . This reduces our case $(b)$ to case $(a)$ .

We have proven that under settings (a) and (b), $\tau(\Delta)=\Theta(\varepsilon^{-(l-1)/2l(l+1)})$ for any fixed $\Delta\in(0,\varphi_{l}^{2}/2)$ . This means that some of the neurons $(\widetilde{a}(\omega),\widetilde{s}(\omega))$ become of order $\Omega(1)$ and the $l$ -th component of the target function is learnt at a timescale of order $\varepsilon^{-(l-1)/2l(l+1)}$ . Next, we show that if the probability measure $\rho$ is discrete, then the evolution of $\mathscrsfs{R}_{l}$ actually happens in an $O(1)$ time window. It suffices to prove that, for any $\Delta>0$ a small constant ( $\Delta<\varphi_{l}^{2}/4$ ),

[TABLE]

as $\varepsilon\to 0$ . Note that by continuity and monotonicity of $\mathscrsfs{R}_{l}$ , we have

[TABLE]

By definition of $\mathscrsfs{R}_{l}$ , we know that $\forall\tau\geq\tau(\varphi_{l}^{2}/2-\Delta)$ ,

[TABLE]

Denote by $\{(\widetilde{a}_{i},\widetilde{s}_{i})\}_{i\in[m]}$ the realizations of $\{(\widetilde{a}(\omega),\widetilde{s}(\omega))\}_{\omega\in\Omega}$ under the discrete measure $\rho$ , and by $\{p_{i}\}_{i\in[m]}$ the point masses of $\rho$ . Then, we know that

[TABLE]

which implies that $\exists j\in[m]$ , s.t. $\left|\widetilde{a}_{j}(\tau)\widetilde{s}_{j}(\tau)^{l}\right|\geq r_{l}(\Delta)$ . Applying Lemma 3 yields

[TABLE]

It then follows from Eq. (92) that $\mathscrsfs{R}_{l}$ will decrease to [math] exponentially fast, and Eq. (140) holds consequently. This completes the proof for settings (a) and (b).

We then focus on the case (c). By our assumption, for almost every $\omega$ there exists $\eta>0$ (may depend on $\omega$ ) such that

[TABLE]

for sufficiently small $\varepsilon$ . Therefore, $\widetilde{s}(\omega,\tau)^{2}$ and $\widetilde{a}(\omega,\tau)^{2}$ will keep decreasing until one of them reaches [math], which means that

[TABLE]

According to Eq. (139) and the inequality $\widetilde{s}(\omega,0)^{2}<(l-\eta)\widetilde{a}(\omega,0)^{2}$ , $\widetilde{a}(\omega,\tau)^{2}$ will not reach [math] until $\widetilde{s}(\omega,\tau)^{2}$ reaches [math]. Furthermore, for any $\tau\geq 0$ ,

[TABLE]

thus leading to

[TABLE]

Using again the comparison theorem for ODE, we get that

[TABLE]

Since $\widetilde{s}(\omega,0)\asymp\varepsilon^{1/2l(l+1)}$ , it follows immediately that for any $\Delta>0$ , there exists a constant $C_{*}(\omega,\Delta)>0$ such that

[TABLE]

This completes the discussion for case (c), thus concluding the proof of Theorem 1.

Lemma 3.

Let $r>0$ be a constant that does not depend on $\varepsilon$ . Then there exists a constant $c=c(l,r)>0$ that only depends on $l$ and $r$ such that the following holds: For any $a>0$ , $s>0$ satisfying $as^{l}\geq r$ and $\varepsilon^{2\beta_{l}}s^{2}\leq 1$ , we have

[TABLE]

Proof.

If $s\geq 1$ , then we immediately get

[TABLE]

Otherwise, $1-\varepsilon^{2\beta_{l}}s^{2}\geq 1/2$ , and consequently

[TABLE]

where the last line follows from the AM-GM inequality. This completes the proof. ∎

Appendix C Proofs of Theorem 2 and 3: learning with projected SGD

We will prove Theorem 2 which bounds the distance between GF and projected SGD in sub-Sections C.1 through C.3, with sub-Section C.4 devoted to the proof of Theorem 3. Throughout this section, we use $M$ to refer to any constant that only depends on the $M_{i}$ ’s from Assumptions A1-A3, whereas the value of $M$ can change from line to line. We start with an elementary lemma that establishes the Lipschitz continuity of the gradient flow trajectory:

Lemma 4 (A priori estimate).

There exists a constant $M$ that only depends on the $M_{i}$ ’s, such that for all $t\geq 0$ , $\rho_{t}$ is supported on $[-M(1+t/\varepsilon),M(1+t/\varepsilon)]\times\mathbb{S}^{d-1}$ , namely $|a_{i}(t)|\leq M(1+t/\varepsilon)$ for all $i\in[m]$ . Moreover, for any $0\leq s\leq t$ , we have

[TABLE]

Proof.

First, notice that along the trajectory of gradient flow, the risk must be non-increasing. In fact, we have

[TABLE]

Therefore, we obtain that

[TABLE]

where the last line follows from our assumption. Since $|a_{i}(0)|\leq M$ , we know that $|a_{i}(t)|\leq M(1+t/\varepsilon)$ , and $|a_{i}(t)-a_{i}(s)|\leq\varepsilon^{-1}M(t-s)$ . Moreover, according to Eq. (5), we have

[TABLE]

thus leading to

[TABLE]

This completes the proof. ∎

In what follows we define two discretized versions of Eq.s (4) and (5), namely the gradient descent (GD) and stochastic gradient descent (SGD) dynamics. They will serve as important intermediate objects for our proof.

•

Gradient descent: Let $\eta>0$ be the step size, and let the initialization be the same as gradient flow: $(\tilde{a}_{i}(0),\tilde{u}_{i}(0))=(a_{i}(0),u_{i}(0))$ for all $i\in[m]$ . We have for $k\in\mathbb{N}$ ,

[TABLE]

where we recall from Eq.s (102) and (103):

[TABLE]

By convention, we have $V(s)=V(s;1,1)$ and $U(s)=U(s;1,1)$ for $s\in[-1,1]$ .

•

One-pass stochastic gradient descent: Under the same choice of the step size and initialization, and let $\{(x_{k},y_{k})\}_{k\in\mathbb{N}^{*}}$ be i.i.d. samples from $\mathrm{P}\in\mathscr{P}(\mathbb{R}^{d}\times\mathbb{R})$ , where

[TABLE]

The iteration equations for one-pass SGD read:

[TABLE]

Note that Eq. (151) can also be written as:

[TABLE]

C.1 Difference between GF and GD

For notational simplicity, we denote $\theta_{i}(t)=(a_{i}(t),u_{i}(t))$ for $i\in[m]$ and $t\geq 0$ , and

[TABLE]

Similarly, $\tilde{\theta}_{i}(k)=(\tilde{a}_{i}(k),\tilde{u}_{i}(k))$ , and

[TABLE]

Moreover, for $\theta=(a,u)$ and $\rho\in\mathscr{P}(\mathbb{R}\times\mathbb{R}^{d})$ , we define the following two functionals:

[TABLE]

and $H_{\varepsilon}(\theta,\rho)=(\varepsilon^{-1}F(\theta,\rho),G(\theta,\rho))$ . Then, Eq.s (4) and (5) and Eq. (150) can be rewritten as

[TABLE]

respectively. The lemma below will be used several times in the proof.

Lemma 5.

Denoting $\rho^{(m)}=(1/m)\sum_{i=1}^{m}\delta_{\theta_{i}}$ and $\rho^{\prime(m)}=(1/m)\sum_{i=1}^{m}\delta_{\theta^{\prime}_{i}}$ . If $\left\|{u_{i}}\right\|_{2}\leq C$ and $\left\|{u^{\prime}_{i}}\right\|_{2}\leq C$ for all $i\in[m]$ ( $C$ is any fixed absolute constant, for example, here we can take $C=2$ ), then we have

[TABLE]

where the constant $M$ only depends on the $M_{i}$ ’s. As a consequence, we obtain that

[TABLE]

Proof.

First, by triangle inequality, we have

[TABLE]

Second, using again triangle inequality, we deduce that

[TABLE]

where $(i)$ follows from the inequality $\left\|{u_{i}u_{i}^{\top}-(u^{\prime}_{i})(u^{\prime}_{i})^{\top}}\right\|_{\mathrm{op}}\leq 2C\left\|{u_{i}-u^{\prime}_{i}}\right\|_{2}$ , which is a result of the following direct calculation:

[TABLE]

This completes the proof of Lemma 5, since the “as a consequence” part follows naturally from the upper bounds obtained earlier. ∎

Lemma 6.

Following the notation and assumption of Lemma 5, we have

[TABLE]

Proof.

By definition of the risk function and triangle inequality, we deduce that

[TABLE]

This concludes the proof. ∎

First, let us define the error function

[TABLE]

and the stopping time $T_{\Delta}=\inf\{t\geq 0:\Delta(t)\geq 1\}$ . For $k\in\mathbb{N}$ and $t=k\eta\leq T_{\Delta}$ , we have the following estimate:

[TABLE]

For any $s\in[0,t]$ , by Lemma 4 and 5 we have (denote $[s]=\eta\lfloor s/\eta\rfloor$ , and notice that we can take $C=2$ since $t\leq T_{\Delta}$ )

[TABLE]

Using again Lemma 4 and 5, we obtain that

[TABLE]

thus leading to

[TABLE]

For $s\leq t\leq T_{\Delta}$ , we have $\Delta(s)^{2}\leq\Delta(s)$ . Hence,

[TABLE]

Applying Grönwall’s inequality yields

[TABLE]

Therefore, for all $T\geq 0$ and $\eta\leq 1/(M\exp((\varepsilon^{-1}+1)MT(1+\varepsilon^{-1}T)^{2}))$ , we have

[TABLE]

This proves $T\leq T_{\Delta}$ , and consequently

[TABLE]

which immediately implies that

[TABLE]

Finally, with the aid of Lemma 6, we get the following upper bound on the difference between the risk of gradient flow and gradient descent:

[TABLE]

To summarize, we have the following:

Theorem 4 (Difference between GF and GD).

There exists a constant $M$ that only depends on the $M_{i}$ ’s, such that for any $T\geq 0$ and

[TABLE]

the following holds for all $t\in[0,T]$ :

[TABLE]

C.2 Difference between GD and SGD

The proof for this section is almost identical to Appendix C.5 in [Mei et al., 2019]. The only difference is that, here we need to verify that $(I_{d}-uu^{\top})\sigma^{\prime}(\langle u,x\rangle)x$ is an $M_{3}$ -sub-Gaussian random vector. This follows from the identity $(I_{d}-uu^{\top})x=x-\langle u,x\rangle u$ and Assumption A3. We thus obtain the following interpolation bound between GD and SGD:

Theorem 5 (Difference between GD and SGD).

There exists a constant $M$ that only depends on the $M_{i}$ ’s, such that for any $T,z\geq 0$ and

[TABLE]

the following happens with probability at least $1-\exp(-z^{2})$ : For all $t\in[0,T]$ , we have

[TABLE]

C.3 Difference between SGD and projected SGD

The aim of this section is to prove a coupling bound between the trajectory of SGD and that of projected SGD, thus finally leading to an upper bound on the difference between the risk of projected gradient flow and projected SGD. To begin with, let us fix $T,z\geq 0$ and choose

[TABLE]

as in Theorem 2, where $M$ is a large enough constant (to be determined later). Define

[TABLE]

then for $k\leq\min(T,T_{\theta})/\eta$ and $i\in[m]$ , we have (note that here $t=k\eta$ )

[TABLE]

Denoting $\mathcal{F}_{k}=\sigma(\bar{\theta}(0),z_{1},\cdots,z_{k})$ , we know from Assumption A3 that, conditioning on $\mathcal{F}_{k}$ , $\sigma^{\prime}(\langle\bar{u}_{i}(k),x_{k+1}\rangle)x_{k+1}$ is an $M_{3}$ -sub-Gaussian random vector. By well-known results on Euclidean norm of sub-Gaussian random vectors (see, e.g., Jin et al. [2019]), we know that there exists a constant $M$ satisfying

[TABLE]

Choosing $\delta=\eta\exp(-z^{2})/(mT)$ and applying a union bound gives

[TABLE]

Therefore, with probability at least $1-\exp(-z^{2})$ , for all $k\leq\min(T,T_{\theta})/\eta$ and $i\in[m]$ , we have

[TABLE]

The above bound also holds for the trajectory of SGD, namely after replacing $\overline{\rho}^{(m)}(k)$ with $\underline{\rho}^{(m)}(k)$ . Now, let us define the approximation error $\Delta_{i}(k)=\underline{u}_{i}(k)-\overline{u}_{i}(k)$ for $i\in[m]$ and $k\in\mathbb{N}$ , then we get the following decomposition:

[TABLE]

where $Z_{i}(k+1)=\Delta_{i}(k+1)-\Delta_{i}(k)-\mathbb{E}\left[\Delta_{i}(k+1)-\Delta_{i}(k)|\mathcal{F}_{k}\right]$ has zero mean. With our choice of $\eta$ , one can verify that as long as $\max(d,m,z)\to\infty$ , Lemma 7 is applicable to

[TABLE]

Hence, we deduce from the definition of $\Delta_{i}(k)$ that

[TABLE]

thus leading to the following estimate:

[TABLE]

where $(i)$ is due to the fact that $u_{1},u_{2}\in\sigma(\mathcal{F}_{k})$ , and $\left\|{u_{1}u_{1}^{\top}-u_{2}u_{2}^{\top}}\right\|_{\mathrm{op}}\leq C\left\|{u_{1}-u_{2}}\right\|_{2}$ . According to the definition of $\widehat{G}_{i}$ , we obtain that

[TABLE]

thus leading to (using the same argument as in the proof of Lemma 5)

[TABLE]

and

[TABLE]

Moreover, by (conditional) sub-Gaussianity of the $\widehat{G}_{i}$ ’s, we know that

[TABLE]

Combining the above estimates, it then follows that

[TABLE]

Using the same proof technique as in Appendix C.5 of Mei et al. [2019], we conclude that

[TABLE]

Similarly as in the proof of Theorem 4, we define

[TABLE]

Then, for $l\leq\min(T,T_{\theta},T_{\Delta})/\eta$ , we have

[TABLE]

Proceeding with the same argument, it follows that

[TABLE]

Therefore, we finally conclude that

[TABLE]

Applying Grönwall’s inequality (discrete version) yields that

[TABLE]

as long as $\max(d,m,z)\to\infty$ with $T=O(1)$ . Note that the above inequality holds for all $l\in[0,\min(T,T_{\theta},T_{\Delta})/\eta]\cap\mathbb{N}$ with probability at least $1-\exp(-z^{2})$ , which further implies that $T_{\theta},T_{\Delta}\geq T$ , and consequently

[TABLE]

Applying again Lemma 6, we deduce that

[TABLE]

Combining the above estimates gives the following:

Theorem 6 (Difference between SGD and projected SGD).

There exists a constant $M$ that only depends on the $M_{i}$ ’s, such that for any $T,z\geq 0$ and

[TABLE]

the following happens with probability at least $1-\exp(-z^{2})$ : For all $t\in[0,T]$ , we have

[TABLE]

Theorem 2 then follows as a result of combining Theorem 4, Theorem 5, and Theorem 6.

Lemma 7.

Let $v_{1}=u_{1}+\eta(I_{d}-u_{1}u_{1}^{\top})g_{1}$ , $v_{2}=\operatorname{Proj}_{\mathbb{S}^{d-1}}(u_{2}+\eta g_{2})$ , where $\left\|{u_{2}}\right\|_{2}=1$ and $\eta\left\|{g_{2}}\right\|_{2}\leq 1/2$ . Then we have

[TABLE]

Proof.

Using Taylor expansion, we know that

[TABLE]

which implies

[TABLE]

The proof is completed by noting that

[TABLE]

∎

C.4 Proof of Theorem 3

By our assumption, we know that the standard learning scenario holds up to level $L$ , and that

[TABLE]

Then, according to Definition 1, there exists $\varepsilon_{*}=\varepsilon_{*}(\delta)$ , $T_{0}=T_{0}(\delta)$ such that for all $\varepsilon\leq\varepsilon_{*}$ and $T=T(\varepsilon,\delta)=T_{0}(\delta)\varepsilon^{1/2L}$ , one has

[TABLE]

Moreover, from Section 4 we know that with probability at least $1-e^{-C^{\prime}m}$ over the i.i.d. initialization,

[TABLE]

where $M^{\prime}$ only depends on $(\sigma,\varphi,{\rm P}_{A})$ . Now we choose $\varepsilon\leq\varepsilon_{*}$ and $T=T(\varepsilon,\delta)=T_{0}(\delta)\varepsilon^{1/2L}$ . It then follows that

[TABLE]

According to Theorem 2, we know that with probability at least $1-\exp(-z)$ ,

[TABLE]

with $n=T/\eta=T(\varepsilon,\delta)/\eta$ . We now take

[TABLE]

Then, by our choice of $m$ and $d$ , we know that $\mathscrsfs{R}(a(T),u(T))\leq 2\delta/3$ . Further, taking

[TABLE]

we obtain that

[TABLE]

The above happens with probability $1-\exp(-C^{\prime}m)-\exp(-z)$ . Hence, our conclusion follows naturally from the assumption $m\geq z$ .

Appendix D Counterexamples to the standard learning scenario

D.1 Case 1: $\sigma_{k}=0$ for some $k\in\mathbb{N}$

For any fixed $(a,u)=(a_{i},u_{i})_{1\leq i\leq m}$ , we have

[TABLE]

Moreover, the risk is always lower bounded by

[TABLE]

where $(i)$ follows from orthogonality between $\mathrm{He}_{k}(\langle u_{*},x\rangle)$ and $f(x;a,u)$ .

D.2 Case 2: $\varphi_{0}=\cdots=\varphi_{k}=0$ for some $k\geq 1$

We consider the reduced mean-field equations (23):

[TABLE]

Note that if $\varphi_{0}=\varphi_{1}=0$ , then $V^{\prime}(s)=s\cdot v(s)$ for some continuous function $v$ . Denoting $a=(a_{1},\cdots,a_{m})$ and $s=(s_{1},\cdots,s_{m})^{\top}$ , the above equation regarding the evolution of the $s_{i}$ ’s can be written as

[TABLE]

where $A(a,s)$ is a matrix-valued function satisfying

[TABLE]

Using the similar a priori estimate as in the proof of Lemma 1, we can show that

[TABLE]

for any finite time $T$ , which immediately implies that $s(t)\equiv 0$ for $t\in[0,T]$ . Therefore, we won’t be able to learn any component of $\varphi$ with degree $\geq 1$ .

D.3 Case 3: $\varphi_{k}=0$ for some $k\geq 1$

We may assume $\sigma_{k}\neq 0$ , and analyze the simplified ODE system (91), which reduces to

[TABLE]

We thus obtain the following equations:

[TABLE]

which means that for any $\tau\geq 0$ ,

[TABLE]

Therefore, most of the neurons cannot evolve to the magnitude of $\Omega(1)$ in the process of learning the $k$ -th component, and therefore fails to provide an effective initialization for learning the next component $\varphi_{k+1}$ .

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abbe et al. [2022] Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory , pages 4782–4887. PMLR, 2022.
2Ambrosio et al. [2005] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures . Springer Science & Business Media, 2005.
3Arnaboldi et al. [2023] Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, and Bruno Loureiro. From high-dimensional & mean-field dynamics to dimensionless odes: A unifying approach to sgd in two-layers networks. ar Xiv preprint ar Xiv:2302.05882 , 2023.
4Arpit et al. [2017] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning , pages 233–242. PMLR, 2017.
5Ba et al. [2022] Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Advances in Neural Information Processing Systems , 2022.
6Baldi and Hornik [1989] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks , 2(1):53–58, 1989.
7Barak et al. [2022] Boaz Barak, Benjamin L Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: Sgd learns parities near the computational limit. ar Xiv:2207.08799 , 2022.
8Bartlett et al. [2021] Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta numerica , 30:87–201, 2021.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning time-scales in two-layers neural networks

Abstract

Contents

1 Introduction

Theory #1\#1#1: Dynamics near singular points.

Theory #2\#2#2: Linear networks.

Theory #3\#3#3: Kernel regime.

Notations.

2 Setting and standard learning scenario

Definition 1**.**

Remark 2.1**.**

3 Further related work

4 The large-network, high-dimensional limit

Remark 4.1**.**

Proposition 1** (Reduction to ddd-independent flow).**

Corollary 1**.**

Proposition 2** (Reduction to flow in R2m\mathbb{R}^{2m}R2m).**

4.1 Connection with mean field theory

4.2 A general formulation

5 Numerical solution

6 Timescales hierarchy in the gradient flow dynamics

Notations.

6.1 First time scale: constant component

6.2 Second time scale: linear component I

Identification of the scale.

Derivation of the ODEs for this time scale.

Matching.

Solution.

6.3 Third time scale: linear component II

6.4 Conjectured behavior for larger time scales

Theorem 1** (Evolution of the simplified gradient flow).**

Remark 6.1**.**

Remark 6.2**.**

7 Stochastic gradient descent and finite sample size

Theorem 2** (Difference between GF and Projected SGD).**

Theorem 3**.**

Remark 7.1**.**

Acknowledgments

Appendix A Appendix to Section 4

A.1 Proof of Proposition 1

A.2 Proof of Corollary 1

A.3 Proof of Proposition 2

Lemma 1**.**

Proof.

A.4 Derivation of the mean field dynamics (28)

A.5 Details of the alternative mean field approach

Appendix B Calculations for the analysis of mean-field gradient flow

B.1 Solution of Eq. (83)

Lemma 2**.**

Proof.

Matching.

B.2 Induced approximation of the risk

First time scale t1=tεt_{1}=\frac{t}{\varepsilon}t1​=εt​ (Section 6.1).

Second time scale t2=tε\nicefrac12t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}t2​=ε\nicefrac12t​ (Section 6.2).

Third time scale t3=tε\nicefrac12−14∣σ1φ1∣log⁡1εt_{3}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}-\frac{1}{4|\sigma_{1}\varphi_{1}|}\log\frac{1}{\varepsilon}t3​=ε\nicefrac12t​−4∣σ1​φ1​∣1​logε1​ (Section 6.3).

B.3 Proof of Theorem 1

Lemma 3**.**

Proof.

Appendix C Proofs of Theorem 2 and 3: learning with projected SGD

Lemma 4** (A priori estimate).**

Proof.

C.1 Difference between GF and GD

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Theorem 4** (Difference between GF and GD).**

C.2 Difference between GD and SGD

Theorem 5** (Difference between GD and SGD).**

C.3 Difference between SGD and projected SGD

Theorem 6** (Difference between SGD and projected SGD).**

Lemma 7**.**

Proof.

Theory $\#1$ : Dynamics near singular points.

Theory $\#2$ : Linear networks.

Theory $\#3$ : Kernel regime.

Definition 1.

Remark 2.1.

Remark 4.1.

Proposition 1 (Reduction to $d$ -independent flow).

Corollary 1.

Proposition 2 (Reduction to flow in $\mathbb{R}^{2m}$ ).

Theorem 1 (Evolution of the simplified gradient flow).

Remark 6.1.

Remark 6.2.

Theorem 2 (Difference between GF and Projected SGD).

Theorem 3.

Remark 7.1.

Lemma 1.

Lemma 2.

First time scale $t_{1}=\frac{t}{\varepsilon}$ (Section 6.1).

Second time scale $t_{2}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}$ (Section 6.2).

Third time scale $t_{3}=\frac{t}{\varepsilon^{\nicefrac{{1}}{{2}}}}-\frac{1}{4|\sigma_{1}\varphi_{1}|}\log\frac{1}{\varepsilon}$ (Section 6.3).

Lemma 3.

Lemma 4 (A priori estimate).

Lemma 5.

Lemma 6.

Theorem 4 (Difference between GF and GD).

Theorem 5 (Difference between GD and SGD).

Theorem 6 (Difference between SGD and projected SGD).

Lemma 7.

D.1 Case 1: $\sigma_{k}=0$ for some $k\in\mathbb{N}$

D.2 Case 2: $\varphi_{0}=\cdots=\varphi_{k}=0$ for some $k\geq 1$

D.3 Case 3: $\varphi_{k}=0$ for some $k\geq 1$