On the Hyperparameters in Stochastic Gradient Descent with Momentum
Bin Shi

TL;DR
This paper provides a theoretical analysis of stochastic gradient descent with momentum, highlighting the combined influence of learning rate and momentum coefficient on convergence, and explaining why momentum enhances convergence speed and robustness.
Contribution
It introduces a hyperparameters-dependent SDE framework to analyze SGD with momentum, deriving explicit convergence rates and offering new insights into the role of momentum in optimization.
Findings
The convergence rate depends on both learning rate and momentum coefficient.
Momentum improves convergence speed and robustness over standard SGD.
Nesterov momentum behaves similarly to standard momentum under noise.
Abstract
Following the same routine as [SSJ20], we continue to present the theoretical analysis for stochastic gradient descent with momentum (SGD with momentum) in this paper. Differently, for SGD with momentum, we demonstrate it is the two hyperparameters together, the learning rate and the momentum coefficient, that play the significant role for the linear rate of convergence in non-convex optimization. Our analysis is based on the use of a hyperparameters-dependent stochastic differential equation (hp-dependent SDE) that serves as a continuous surrogate for SGD with momentum. Similarly, we establish the linear convergence for the continuous-time formulation of SGD with momentum and obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Kramers-Fokker-Planck operator. By comparison, we demonstrate how the optimal linear rate of convergence and the final gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods
MethodsSGD with Momentum · Stochastic Gradient Descent
