Momentum-Based Variance Reduction in Non-Convex SGD
Ashok Cutkosky, Francesco Orabona

TL;DR
This paper introduces STORM, a momentum-based variance reduction algorithm for non-convex stochastic gradient descent that eliminates the need for batch sizes and hyperparameter tuning, achieving optimal convergence rates.
Contribution
The paper proposes STORM, a novel variance reduction method using momentum that simplifies implementation and removes the need for batch size tuning in non-convex optimization.
Findings
Achieves optimal convergence rate without batch sizes or variance knowledge.
Uses adaptive learning rates for simpler implementation.
Matches the best known theoretical convergence bounds.
Abstract
Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses , STORM finds a point with in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods
