A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo
S. Gratton, Ph. L. Toint

TL;DR
This paper introduces a unified convergence framework for various adaptive first-order optimization algorithms in nonconvex settings, encompassing methods like AdaGrad, AdaNorm, Shampoo, and Muo, with comprehensive stochastic convergence analysis.
Contribution
It provides a unified theoretical analysis for multiple adaptive methods, allowing heterogeneous geometries and including momentum, without restrictive assumptions.
Findings
Established a stochastic global convergence rate for all methods in the framework.
Unified analysis applies to methods with and without momentum.
Framework accommodates heterogeneous variable geometries.
Abstract
A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo andMuon. This framework also allows combining heterogeneous geometriesacross different groups of variables while preserving a unifiedconvergence analysis. A fully stochastic global rate-of-convergenceanalysis is conducted for all methods in the framework, with andwithout two types of momentum, using reasonable assumptions on thevariance of the gradient oracle and without assuming boundedstochastic gradients or small enough stepsize.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
