The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance
Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari,, Sanjay Shakkottai, Rachel Ward

TL;DR
This paper proves that AdaGrad-Norm, an adaptive stochastic gradient method, achieves near-optimal convergence rates in non-convex optimization without parameter tuning, even with unbounded gradients and affine noise variance.
Contribution
It demonstrates that AdaGrad-Norm attains order-optimal convergence rates under broad conditions, removing the need for tuning and relaxing previous assumptions.
Findings
AdaGrad-Norm achieves $ ilde{O}(1/ oot T)$ convergence rate.
The method works with unbounded gradients and affine noise variance.
No parameter tuning is required for convergence guarantees.
Abstract
We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of after iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques
MethodsStochastic Gradient Descent · Network On Network
