Momentum-Based Variance Reduction in Non-Convex SGD

Ashok Cutkosky; Francesco Orabona

arXiv:1905.10018·cs.LG·April 23, 2020·31 cites

Momentum-Based Variance Reduction in Non-Convex SGD

Ashok Cutkosky, Francesco Orabona

PDF

Open Access 2 Repos

TL;DR

This paper introduces STORM, a momentum-based variance reduction algorithm for non-convex stochastic gradient descent that eliminates the need for batch sizes and hyperparameter tuning, achieving optimal convergence rates.

Contribution

The paper proposes STORM, a novel variance reduction method using momentum that simplifies implementation and removes the need for batch size tuning in non-convex optimization.

Findings

01

Achieves optimal convergence rate without batch sizes or variance knowledge.

02

Uses adaptive learning rates for simpler implementation.

03

Matches the best known theoretical convergence bounds.

Abstract

Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $F$ , STORM finds a point $x$ with $E [∥\nabla F (x) ∥] \leq O (1/ T + σ^{1/3} / T^{1/3})$ in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods