How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD
Zeyuan Allen-Zhu

TL;DR
This paper introduces new stochastic gradient algorithms, SGD3 and SGD5, that achieve faster convergence rates for making gradients small in convex and nonconvex optimization, surpassing previous methods.
Contribution
The paper presents two novel algorithms, SGD3 and SGD5, with near-optimal convergence rates for gradient norm reduction in convex and nonconvex settings, improving upon prior work.
Findings
SGD3 achieves rate () for convex objectives.
SGD5 achieves rate () for nonconvex objectives.
Both algorithms match or improve upon the best known stochastic methods.
Abstract
Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives . However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when is convex. If is convex, to find a point with gradient norm , we design an algorithm SGD3 with a near-optimal rate , improving the best known rate of [18]. If is nonconvex, to find its -approximate local minimum, we design an algorithm SGD5 with rate , where previously SGD variants only achieve [6, 15, 33]. This is no slower than the best known stochastic version of Newton's method in all parameter regimes [30].
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Image Processing Techniques
MethodsStochastic Gradient Descent
