On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization
Sharan Sahu, Cameron J. Hogan, Martin T. Wells

TL;DR
This paper provides a theoretical analysis showing that momentum-based SGD variants can be suboptimal in nonstationary environments due to drift amplification, with vanilla SGD often outperforming them.
Contribution
It offers finite-time bounds and minimax lower bounds demonstrating the fundamental limitations of momentum in tracking time-varying optima under distribution shifts.
Findings
Momentum incurs a drift-amplification penalty as momentum parameter approaches 1.
In drift-dominated regimes, momentum causes systematic tracking lag.
Vanilla SGD provably outperforms momentum in certain dynamic settings.
Abstract
In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference
