Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization
Aaron Defazio

TL;DR
This paper provides a rigorous theoretical analysis of momentum methods in non-convex optimization, offering insights into their performance and effective learning rate schedules for training deep neural networks.
Contribution
It develops a tighter Lyapunov analysis of SGD with momentum using the stochastic primal averaging form, revealing conditions for improved performance and optimal hyper-parameter schedules.
Findings
Tighter theoretical bounds for SGD+M in non-convex settings
Insights into when momentum outperforms standard SGD
Guidance on hyper-parameter scheduling for momentum methods
Abstract
Momentum methods are now used pervasively within the machine learning community for training non-convex models such as deep neural networks. Empirically, they out perform traditional stochastic gradient descent (SGD) approaches. In this work we develop a Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form. This analysis is much tighter than previous theory in the non-convex case, and due to this we are able to give precise insights into when SGD+M may out-perform SGD, and what hyper-parameter schedules will work and why.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and Algorithms
MethodsSGD with Momentum · Stochastic Gradient Descent
