Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions
Bohan Wang, Huishuai Zhang, Zhi-Ming Ma, Wei Chen

TL;DR
This paper presents a simplified convergence proof for AdaGrad on non-convex functions, achieving tighter iteration bounds and extending analysis to local smoothness conditions, thereby broadening understanding of AdaGrad's effectiveness.
Contribution
The authors introduce a novel auxiliary function to simplify convergence proofs for AdaGrad, improving bounds and extending analysis to local smoothness assumptions.
Findings
AdaGrad converges in (rac{1}{\u03b5^2}) iterations in over-parameterized regimes.
Tighter convergence rates than previous (rac{1}{^4}) bounds are established.
Convergence under (L_0,L_1) smoothness depends on learning rate, which is shown to be necessary.
Abstract
We provide a simple convergence proof for AdaGrad optimizing non-convex objectives under only affine noise variance and bounded smoothness assumptions. The proof is essentially based on a novel auxiliary function that helps eliminate the complexity of handling the correlation between the numerator and denominator of AdaGrad's update. Leveraging simple proofs, we are able to obtain tighter results than existing results \citep{faw2022power} and extend the analysis to several new and important cases. Specifically, for the over-parameterized regime, we show that AdaGrad needs only iterations to ensure the gradient norm smaller than , which matches the rate of SGD and significantly tighter than existing rates for AdaGrad. We then discard the bounded smoothness assumption and consider a realistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Bandit Algorithms Research
MethodsStochastic Gradient Descent · AdaGrad
