Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

Sharan Sahu; Abir Sarkar; Cameron J. Hogan; Martin T. Wells

arXiv:2605.04269·stat.ML·May 7, 2026

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

Sharan Sahu, Abir Sarkar, Cameron J. Hogan, Martin T. Wells

PDF

TL;DR

This paper provides a theoretical analysis of Adam's performance in non-stationary stochastic optimization, revealing a noise-drift tradeoff that explains when Adam outperforms or underperforms compared to SGD.

Contribution

It offers the first rigorous bounds characterizing Adam's behavior under non-stationarity, highlighting the impact of hyperparameters and distribution shifts.

Findings

01

Derived finite-time bounds for Adam's tracking error and stationarity gap.

02

Identified a noise--drift tradeoff influencing Adam's stability and performance.

03

Characterized when adaptive step-sizing benefits or harms Adam in non-stationary settings.

Abstract

We provide a theoretical analysis of Adam under non-stationary stochastic objectives, separating two regimes: Euclidean tracking under adaptive strong monotonicity of the Adam-preconditioned mean-gradient operator, and high-probability projected stationarity guarantees under general $L$ -smooth objectives. In the tracking regime, we derive finite-time expected and high-probability bounds that decompose sharply into four components: initialization, objective drift, a first-moment tracking error governed by $β_{1}$ , and a preconditioner perturbation governed by $β_{2}$ . We characterize the burn-in time to reach Adam's irreducible tracking floor under constant and step-decay schedules. We also prove a high-probability bound on the average projected stationarity gap for Adam under distribution shift. Across both analyses, our bounds reveal a noise--drift tradeoff: in noise-dominated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.