Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed
Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horv\'ath, Martin Tak\'a\v{c}, Eduard Gorbunov

TL;DR
This paper demonstrates that gradient clipping significantly improves the high-probability convergence of Adam-Norm and AdaGrad-Norm algorithms in the presence of heavy-tailed noise, both theoretically and empirically.
Contribution
It provides the first theoretical analysis showing how clipping fixes convergence issues of Adam and AdaGrad under heavy-tailed noise, with new bounds and empirical validation.
Findings
Clipping improves convergence bounds for Adam-Norm and AdaGrad-Norm.
Without clipping, these methods can have poor high-probability convergence under heavy-tailed noise.
Empirical results confirm the superiority of clipped variants in heavy-tailed noise scenarios.
Abstract
Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural dynamics and brain function · Neural Networks and Applications
MethodsAdaGrad · Gradient Clipping · Adam
