SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration

Dmitry Kovalev

arXiv:2506.23803·cs.LG·July 1, 2025

SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration

Dmitry Kovalev

PDF

Open Access 3 Reviews

TL;DR

This paper provides a unified theoretical framework for adaptive gradient methods like AdaGrad, demonstrating their acceleration with momentum and explaining their practical success, including the Adam optimizer.

Contribution

It offers a unified convergence analysis for adaptive preconditioning methods and establishes the acceleration of these methods with Nesterov momentum, linking recent algorithms.

Findings

01

Unified convergence analysis for adaptive gradient methods.

02

First theoretical guarantees for DASGO.

03

Acceleration of adaptive methods with momentum.

Abstract

In this paper, we revisit stochastic gradient descent (SGD) with AdaGrad-type preconditioning. Our contributions are twofold. First, we develop a unified convergence analysis of SGD with adaptive preconditioning under anisotropic or matrix smoothness and noise assumptions. This allows us to recover state-of-the-art convergence results for several popular adaptive gradient methods, including AdaGrad-Norm, AdaGrad, and ASGO/One-sided Shampoo. In addition, we establish the fundamental connection between two recently proposed algorithms, Scion and DASGO, and provide the first theoretical guarantees for the latter. Second, we show that the convergence of methods like AdaGrad and DASGO can be provably accelerated beyond the best-known rates using Nesterov momentum. Consequently, we obtain the first theoretical justification that AdaGrad-type algorithms can simultaneously benefit from both…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

1. The paper offers a clear and unified theoretical analysis of adaptive optimization methods. Its findings are novel, with a single theorem (Theorem 1) that recovers known bounds for several adaptive algorithms and establishes the first convergence guarantees for DASGO. The framework’s generality extends to both the smoothness and variance assumptions, and the technical contributions are strong. 2. Algorithm 2 enhances adaptive methods with diagonal preconditioning by incorporating momentum, ac

Weaknesses

1. Several results implicitly require the smoothness and noise operators (e.g., $L,\Sigma$) to live in the same structured space as the preconditioner $\mathcal H$ (e.g., Assumption 2, even before the acceleration results). For diagonal $\mathcal H$, this effectively enforces axis-aligned curvature/noise, leaving other cases out of scope. In problems where principal directions are not coordinate-aligned, the guarantees may not hold. 2. The theory sets $\eta$ proportional to a radius $R \ge R(x^

Reviewer 02Rating 2Confidence 3

Strengths

1. The paper is written in an easy-to-follow way, making all the results and notations easy to understand. 2. The convergence results are obtained under Holder smoothness, which is a more general assumption compared to the existing works for the same algorithms.

Weaknesses

1. There seems to be some overclaims in the contribution part. Firstly, there has been established work on the acceleration of adaptive gradient methods. I kindly refer the authors to [1] for acceleration results of adaptive gradient methods in the diagonal preconditioner case. Also, the convergence of DASGO can actually be covered by [2] in Theorem 3.11 for block-wise RMSProp, making the contribution of this paper kind of not significant enough. 2. Assumption 4 in the acceleration part is very

Reviewer 03Rating 4Confidence 4

Strengths

* Provides solid theoretical results through a unified analysis of stochastic gradient descent with adaptive preconditioning. * Offers a clear and well-structured overview of prior work. * Demonstrates overall rigor in the presentation of assumptions and theoretical arguments. * Includes a relevant and insightful analysis of the Nesterov acceleration applied to the proposed algorithm.

Weaknesses

* The level of novelty of the paper is difficult to assess in comparison with existing work. * The paper lacks numerical experiments that could illustrate the potential benefits of the theoretical results and provide insights into the empirical performance of algorithms encompassed by the proposed unified analysis.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Gaussian Processes and Bayesian Inference