Momentum via Primal Averaging: Theoretical Insights and Learning Rate   Schedules for Non-Convex Optimization

Aaron Defazio

arXiv:2010.00406·cs.LG·June 2, 2021·6 cites

Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization

Aaron Defazio

PDF

Open Access 1 Repo

TL;DR

This paper provides a rigorous theoretical analysis of momentum methods in non-convex optimization, offering insights into their performance and effective learning rate schedules for training deep neural networks.

Contribution

It develops a tighter Lyapunov analysis of SGD with momentum using the stochastic primal averaging form, revealing conditions for improved performance and optimal hyper-parameter schedules.

Findings

01

Tighter theoretical bounds for SGD+M in non-convex settings

02

Insights into when momentum outperforms standard SGD

03

Guidance on hyper-parameter scheduling for momentum methods

Abstract

Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they out perform traditional stochastic gradient descent (SGD) approaches. In this work we develop a Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form. This analysis is much tighter than previous theory in the non-convex case, and due to this we are able to give precise insights into when SGD+M may out-perform SGD, and what hyper-parameter schedules will work and why.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/madgrad
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and Algorithms

MethodsSGD with Momentum · Stochastic Gradient Descent