On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

Huan Li; Yiming Dong; Zhouchen Lin

arXiv:2505.11840·cs.LG·October 6, 2025

On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

Huan Li, Yiming Dong, Zhouchen Lin

PDF

Open Access

TL;DR

This paper establishes the convergence rate of AdamW optimizer in high-dimensional settings, showing it is comparable to SGD's optimal rate, supported by theoretical analysis and empirical validation.

Contribution

The paper provides the first theoretical convergence rate of AdamW in terms of $oldsymbol{ ext{l}_1}$ norm, extending the analysis to NAdamW and validating with experiments.

Findings

01

Convergence rate of AdamW is $O(rac{ extsqrt{d}}{K^{1/4}})$ in $ ext{l}_1$ norm.

02

Empirical results show $ ext{l}_1$ norm scales as $ extsqrt{d}$ times $ ext{l}_2$ norm.

03

NAdamW shares the same convergence rate as AdamW.

Abstract

As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K} \sum_{k = 1}^{K} E [∣∣\nabla f (x^{k}) ∣ ∣_{1}] \leq O (\frac{d C}{K ^{1/4}})$ for AdamW measured by $ℓ_{1}$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $∣∣\nabla f (x) ∣ ∣_{2} ≪ ∣∣\nabla f (x) ∣ ∣_{1} \leq d ∣∣\nabla f (x) ∣ ∣_{2}$ for any high-dimensional vector $x$ and $E [∣∣\nabla f (x) ∣ ∣_{1}] \geq \frac{2 d}{π} E [∣∣\nabla f (x) ∣ ∣_{2}]$ when each element of $\nabla f (x)$ is generated from Gaussian distribution $N (0, 1)$ . Empirically, our experimental results on real-world deep learning tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Stochastic Gradient Optimization Techniques

MethodsAdamW · Stochastic Gradient Descent