Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Arseniy Andreyev; Advikar Ananthkumar; Marc Walden; Tomaso Poggio; Pierfrancesco Beneventano

arXiv:2604.14108·cs.LG·April 16, 2026

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, Pierfrancesco Beneventano

PDF

TL;DR

This paper reveals how momentum in stochastic gradient descent influences stability and sharpness, showing two distinct regimes depending on batch size, which impacts optimization and solution quality.

Contribution

It uncovers the batch-size-dependent regimes of stochastic stability with momentum, extending the Edge of Stochastic Stability concept to practical deep learning settings.

Findings

01

Batch Sharpness stabilizes at two regimes depending on batch size.

02

Momentum amplifies stochastic fluctuations at small batch sizes.

03

Momentum recovers classical stabilization at large batch sizes.

Abstract

Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2 (1 - β) / η$ , reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2 (1 + β) / η$ , where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.