When and Why Momentum Accelerates SGD:An Empirical Study

Jingwen Fu; Bohan Wang; Huishuai Zhang; Zhizheng Zhang; Wei Chen,; Nanning Zheng

arXiv:2306.09000·cs.LG·June 16, 2023·1 cites

When and Why Momentum Accelerates SGD:An Empirical Study

Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen,, Nanning Zheng

PDF

Open Access

TL;DR

This paper empirically investigates when and why momentum accelerates SGD, revealing that momentum's benefits are linked to effective learning rates and the prevention of abrupt sharpening, especially with larger batch sizes.

Contribution

It introduces a comparison framework based on effective learning rates and uncovers the role of abrupt sharpening in momentum acceleration during SGD training.

Findings

01

SGDM and SGD perform similarly at low effective learning rates.

02

SGDM outperforms SGD when effective learning rates exceed a threshold.

03

Momentum helps prevent abrupt sharpening, improving convergence especially with larger batch sizes.

Abstract

Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the \emph{effective learning rates} $η_{e f}$ , a notion unifying the influence of momentum coefficient $μ$ and batch size $b$ over learning rate $η$ . In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when $η_{e f}$ is small, SGDM and SGD experience almost the same empirical training losses; when $η_{e f}$ surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Gaussian Processes and Bayesian Inference

MethodsSGD with Momentum · Stochastic Gradient Descent