Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization
Sharan Sahu, Abir Sarkar, Cameron J. Hogan, Martin T. Wells

TL;DR
This paper provides a theoretical analysis of Adam's performance in non-stationary stochastic optimization, revealing a noise-drift tradeoff that explains when Adam outperforms or underperforms compared to SGD.
Contribution
It offers the first rigorous bounds characterizing Adam's behavior under non-stationarity, highlighting the impact of hyperparameters and distribution shifts.
Findings
Derived finite-time bounds for Adam's tracking error and stationarity gap.
Identified a noise--drift tradeoff influencing Adam's stability and performance.
Characterized when adaptive step-sizing benefits or harms Adam in non-stationary settings.
Abstract
We provide a theoretical analysis of Adam under non-stationary stochastic objectives, separating two regimes: Euclidean tracking under adaptive strong monotonicity of the Adam-preconditioned mean-gradient operator, and high-probability projected stationarity guarantees under general -smooth objectives. In the tracking regime, we derive finite-time expected and high-probability bounds that decompose sharply into four components: initialization, objective drift, a first-moment tracking error governed by , and a preconditioner perturbation governed by . We characterize the burn-in time to reach Adam's irreducible tracking floor under constant and step-decay schedules. We also prove a high-probability bound on the average projected stationarity gap for Adam under distribution shift. Across both analyses, our bounds reveal a noise--drift tradeoff: in noise-dominated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
