A Methodology Establishing Linear Convergence of Adaptive Gradient Methods under PL Inequality
Kushal Chakrabarti, Mayank Baranwal

TL;DR
This paper proves that popular adaptive gradient methods like AdaGrad and Adam achieve linear convergence when optimizing functions that satisfy the Polyak-Łojasiewicz inequality, providing new theoretical guarantees.
Contribution
It establishes the first linear convergence proofs for AdaGrad and Adam under the PL inequality, unifying analysis for both batch and stochastic gradients.
Findings
AdaGrad and Adam converge linearly under PL inequality.
The framework applies to both batch and stochastic gradients.
Potential for analyzing other Adam variants.
Abstract
Adaptive gradient-descent optimizers are the standard choice for training neural network models. Despite their faster convergence than gradient-descent and remarkable performance in practice, the adaptive optimizers are not as well understood as vanilla gradient-descent. A reason is that the dynamic update of the learning rate that helps in faster convergence of these methods also makes their analysis intricate. Particularly, the simple gradient-descent method converges at a linear rate for a class of optimization problems, whereas the practically faster adaptive gradient methods lack such a theoretical guarantee. The Polyak-{\L}ojasiewicz (PL) inequality is the weakest known class, for which linear convergence of gradient-descent and its momentum variants has been proved. Therefore, in this paper, we prove that AdaGrad and Adam, two well-known adaptive gradient methods, converge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Numerical Analysis Techniques
MethodsAdam · AdaGrad
