Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

Zijian Liu

arXiv:2605.18694·math.OC·May 19, 2026

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

Zijian Liu

PDF

TL;DR

This paper investigates whether adaptive gradient methods like AdaGrad can converge under heavy-tailed gradient noise without modifications, providing new theoretical convergence rates and bounds in non-convex optimization.

Contribution

It offers the first provable convergence rate for AdaGrad under heavy-tailed noise, without prior knowledge of the tail index, and introduces bounds showing limitations of AdaGrad compared to minimax rates.

Findings

01

AdaGrad converges under heavy-tailed noise with a rate depending on the tail index p.

02

An algorithm-dependent lower bound indicates AdaGrad cannot attain the minimax rate.

03

AdaGrad-Norm achieves an improved rate under mild additional assumptions.

Abstract

Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $Adam$ and $AdamW$ , often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, $AdaGrad$ , the origin of adaptive gradient methods. We provide the first provable convergence rate for $AdaGrad$ in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.