SGD at the Edge of Stability: The Stochastic Sharpness Gap
Fangshuo Liao, Afroditi Kolomvaki, Anastasios Kyrillidis

TL;DR
This paper extends the understanding of the Edge of Stability phenomenon in neural network training by analyzing how stochastic gradient noise influences sharpness, leading to a predictable gap below the theoretical maximum.
Contribution
It introduces stochastic self-stabilization, providing a theoretical framework and a closed-form formula for the sharpness gap in SGD, explaining the effects of batch size on solution sharpness.
Findings
SGD stabilizes sharpness below 2/η due to gradient noise.
Derived a formula predicting the sharpness gap based on noise and training parameters.
Smaller batch sizes lead to flatter solutions, matching empirical observations.
Abstract
When training neural networks with full-batch gradient descent (GD) and step size , the largest eigenvalue of the Hessian -- the sharpness -- rises to and hovers there, a phenomenon termed the Edge of Stability (EoS). \citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint . For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below , with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
