When and Why Momentum Accelerates SGD:An Empirical Study
Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen,, Nanning Zheng

TL;DR
This paper empirically investigates when and why momentum accelerates SGD, revealing that momentum's benefits are linked to effective learning rates and the prevention of abrupt sharpening, especially with larger batch sizes.
Contribution
It introduces a comparison framework based on effective learning rates and uncovers the role of abrupt sharpening in momentum acceleration during SGD training.
Findings
SGDM and SGD perform similarly at low effective learning rates.
SGDM outperforms SGD when effective learning rates exceed a threshold.
Momentum helps prevent abrupt sharpening, improving convergence especially with larger batch sizes.
Abstract
Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the \emph{effective learning rates} , a notion unifying the influence of momentum coefficient and batch size over learning rate . In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when is small, SGDM and SGD experience almost the same empirical training losses; when surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Gaussian Processes and Bayesian Inference
MethodsSGD with Momentum · Stochastic Gradient Descent
