Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD
Arseniy Andreyev, Pierfrancesco Beneventano

TL;DR
This paper explores the Edge of Stochastic Stability (EoSS) in mini-batch SGD, revealing how batch sharpness stabilizes around 2/η and influences the curvature and minima of neural network training.
Contribution
It introduces the EoSS regime for mini-batch SGD, linking batch sharpness stabilization to training dynamics and generalization, extending prior full-batch stability results.
Findings
Batch sharpness stabilizes around 2/η in EoSS regime.
λ_max is smaller than batch sharpness, explaining flatter minima.
Smaller batches and larger step sizes favor flatter minima.
Abstract
Recent findings by Cohen et al., 2021, demonstrate that when training neural networks using full-batch gradient descent with a step size of , the largest eigenvalue of the full-batch Hessian consistently stabilizes around . These results have significant implications for convergence and generalization. This, however, is not the case for mini-batch optimization algorithms, limiting the broader applicabilityof the consequences of these findings. We show mini-batch Stochastic Gradient Descent (SGD) trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence -- which is generally smaller than Batch Sharpness -- is suppressed, aligning with the…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper addresses a fundamental and underexplored aspect of deep learning optimization—the instability regime of SGD—and makes an original attempt to distinguish between two forms of oscillation. The proposed concepts of curvature-driven oscillation and batch sharpness are interesting and potentially valuable for understanding SGD dynamics. - Extensive experiments on CIFAR-10 provide empirical evidence that batch sharpness remains close to 2/η, supporting the proposed hypothesis.
- The Introduction and Background sections are disproportionately long (occupying over one-third of the main text), while the intuitive motivation or justification for the definition of batch sharpness, as well as its theoretical connection to the catapult effect, are insufficiently developed. Reducing the background in favor of more focused explanations on these points would improve the paper’s overall clarity and intuition. - The current definition of the catapult effect is somewhat ambiguous.
- The paper is very well motivated. Its contributions are solid and important. The breakdown of the oscillation in the SGD case is simple, elegant, and very useful. The comparisons with previous work seem sufficient. - In addition to their main findings, the authors' conclusions re. SGD vs. noisy gradient descent is very useful, and speaks to the potential of their findings to progress the field.
- Although the paper is about extending EoS to SGD, and involves a lot of comparisons between the two, the authors do not have an authoritative explanation why SGD does not follow EoS. That is, their results show why SGD follows EoSS, but it does not show why/when following EoSS corresponds to not following EoS - in the form of a more specific relationship between batch sharpness and $\lambda_{\max}$. In this sense the paper parallels Cohen et al. 2021 but does not complement it. Experiments at
The paper contains many fine details and extensive discussions on many different aspects, as well as the literature background. The research problem that is proposed in the paper is an interesting and important topic worth investigating.
(1) The presentation of the paper needs to be improved. I find it not that easy to follow. The Appendix is super long, and contains many random topics that do not seem to capture the essence of the major contributions of the paper. (2) When I read the proofs in the Appendix, there are too many places you used $\approx$ which should be made more rigorous by using Big O notation or other notations that can be made rigorous or at least you should make the meaning of $\approx$ more transparent and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic processes and financial applications · Reservoir Engineering and Simulation Methods
MethodsStochastic Gradient Descent
