A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

S. Gratton; Ph. L. Toint

arXiv:2604.17423·cs.LG·May 4, 2026

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

S. Gratton, Ph. L. Toint

PDF

TL;DR

This paper introduces a unified convergence framework for various adaptive first-order optimization algorithms in nonconvex settings, encompassing methods like AdaGrad, AdaNorm, Shampoo, and Muo, with comprehensive stochastic convergence analysis.

Contribution

It provides a unified theoretical analysis for multiple adaptive methods, allowing heterogeneous geometries and including momentum, without restrictive assumptions.

Findings

01

Established a stochastic global convergence rate for all methods in the framework.

02

Unified analysis applies to methods with and without momentum.

03

Framework accommodates heterogeneous variable geometries.

Abstract

A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo andMuon. This framework also allows combining heterogeneous geometriesacross different groups of variables while preserving a unifiedconvergence analysis. A fully stochastic global rate-of-convergenceanalysis is conducted for all methods in the framework, with andwithout two types of momentum, using reasonable assumptions on thevariance of the gradient oracle and without assuming boundedstochastic gradients or small enough stepsize.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.