On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

Sharan Sahu; Cameron J. Hogan; Martin T. Wells

arXiv:2601.12238·stat.ML·May 20, 2026

On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

Sharan Sahu, Cameron J. Hogan, Martin T. Wells

PDF

TL;DR

This paper provides a theoretical analysis showing that momentum-based SGD variants can be suboptimal in nonstationary environments due to drift amplification, with vanilla SGD often outperforming them.

Contribution

It offers finite-time bounds and minimax lower bounds demonstrating the fundamental limitations of momentum in tracking time-varying optima under distribution shifts.

Findings

01

Momentum incurs a drift-amplification penalty as momentum parameter approaches 1.

02

In drift-dominated regimes, momentum causes systematic tracking lag.

03

Vanilla SGD provably outperforms momentum in certain dynamic settings.

Abstract

In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter $β$ approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference