On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm
Huan Li, Yiming Dong, Zhouchen Lin

TL;DR
This paper establishes the convergence rate of AdamW optimizer in high-dimensional settings, showing it is comparable to SGD's optimal rate, supported by theoretical analysis and empirical validation.
Contribution
The paper provides the first theoretical convergence rate of AdamW in terms of $oldsymbol{ ext{l}_1}$ norm, extending the analysis to NAdamW and validating with experiments.
Findings
Convergence rate of AdamW is $O(rac{ extsqrt{d}}{K^{1/4}})$ in $ ext{l}_1$ norm.
Empirical results show $ ext{l}_1$ norm scales as $ extsqrt{d}$ times $ ext{l}_2$ norm.
NAdamW shares the same convergence rate as AdamW.
Abstract
As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate for AdamW measured by norm, where represents the iteration number, denotes the model dimension, and matches the constant in the optimal convergence rate of SGD. Theoretically, we have for any high-dimensional vector and when each element of is generated from Gaussian distribution . Empirically, our experimental results on real-world deep learning tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Stochastic Gradient Optimization Techniques
MethodsAdamW · Stochastic Gradient Descent
