Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks

Guillaume Corlouer; Avi Semler; Alexander Strang; Alexander Gietelink Oldenziel

arXiv:2604.06366·cs.LG·April 9, 2026

Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks

Guillaume Corlouer, Avi Semler, Alexander Strang, Alexander Gietelink Oldenziel

PDF

TL;DR

This paper analyzes how stochastic gradient descent influences the training dynamics of deep linear networks, revealing that noise encodes feature learning progression without changing the saddle-to-saddle regime.

Contribution

It provides an exact stochastic differential equation decomposition for SGD in DLNs and characterizes the stationary distribution, highlighting the role of noise in feature learning.

Findings

01

Maximal diffusion along a mode occurs before the feature is fully learned.

02

In absence of label noise, SGD stationary distribution matches gradient flow.

03

With label noise, the distribution approximates a Boltzmann distribution.

Abstract

Deep linear networks (DLNs) are used as an analytically tractable model of the training dynamics of deep neural networks. While gradient descent in DLNs is known to exhibit saddle-to-saddle dynamics, the impact of stochastic gradient descent (SGD) noise on this regime remains poorly understood. We investigate the dynamics of SGD during training of DLNs in the saddle-to-saddle regime. We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. We also derive the stationary distribution of SGD for each mode: in the absence of label noise, its marginal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.