The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded   Gradients and Affine Variance

Matthew Faw; Isidoros Tziotis; Constantine Caramanis; Aryan Mokhtari,; Sanjay Shakkottai; Rachel Ward

arXiv:2202.05791·stat.ML·July 26, 2022·1 cites

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari,, Sanjay Shakkottai, Rachel Ward

PDF

Open Access

TL;DR

This paper proves that AdaGrad-Norm, an adaptive stochastic gradient method, achieves near-optimal convergence rates in non-convex optimization without parameter tuning, even with unbounded gradients and affine noise variance.

Contribution

It demonstrates that AdaGrad-Norm attains order-optimal convergence rates under broad conditions, removing the need for tuning and relaxing previous assumptions.

Findings

01

AdaGrad-Norm achieves $ ilde{O}(1/ oot T)$ convergence rate.

02

The method works with unbounded gradients and affine noise variance.

03

No parameter tuning is required for convergence guarantees.

Abstract

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $O (\frac{poly l o g ( T )}{T})$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques

MethodsStochastic Gradient Descent · Network On Network