Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Huan Li; Yiming Dong; Zhouchen Lin

arXiv:2601.07326·math.OC·May 4, 2026

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Huan Li, Yiming Dong, Zhouchen Lin

PDF

TL;DR

This paper analyzes the convergence rate of the AdamW-style Shampoo optimizer, unifying preconditioning methods and establishing a rate comparable to SGD, with theoretical support for its effectiveness in neural network training.

Contribution

It provides a unified convergence analysis for AdamW-style Shampoo, connecting one-sided and two-sided preconditioning, and relates its rate to that of SGD.

Findings

01

Convergence rate of O((sqrt(m+n)C)/K^{1/4}) measured by nuclear norm.

02

Theoretical bounds relate nuclear norm and Frobenius norm of gradients.

03

Supports the optimizer's effectiveness with convergence comparable to SGD.

Abstract

This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K} \sum_{k = 1}^{K} E [∥\nabla f (X_{k}) ∥_{*}] \leq O (\frac{m + n C}{K ^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m, n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $∥\nabla f (X) ∥_{F} \leq ∥\nabla f (X) ∥_{*} \leq m + n ∥\nabla f (X) ∥_{F}$ , supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K} \sum_{k = 1}^{K} E [∥\nabla f (X_{k}) ∥_{F}] \leq O (\frac{C}{K ^{1/4}})$ convergence rate of SGD in the ideal case of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.