Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks
Guillaume Corlouer, Avi Semler, Alexander Strang, Alexander Gietelink Oldenziel

TL;DR
This paper analyzes how stochastic gradient descent influences the training dynamics of deep linear networks, revealing that noise encodes feature learning progression without changing the saddle-to-saddle regime.
Contribution
It provides an exact stochastic differential equation decomposition for SGD in DLNs and characterizes the stationary distribution, highlighting the role of noise in feature learning.
Findings
Maximal diffusion along a mode occurs before the feature is fully learned.
In absence of label noise, SGD stationary distribution matches gradient flow.
With label noise, the distribution approximates a Boltzmann distribution.
Abstract
Deep linear networks (DLNs) are used as an analytically tractable model of the training dynamics of deep neural networks. While gradient descent in DLNs is known to exhibit saddle-to-saddle dynamics, the impact of stochastic gradient descent (SGD) noise on this regime remains poorly understood. We investigate the dynamics of SGD during training of DLNs in the saddle-to-saddle regime. We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. We also derive the stationary distribution of SGD for each mode: in the absence of label noise, its marginal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
