Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
Bao Wang, Tan M. Nguyen, Andrea L. Bertozzi, Richard G. Baraniuk,, Stanley J. Osher

TL;DR
This paper introduces Scheduled Restart SGD (SRSGD), a novel optimization scheme that enhances convergence and generalization in deep neural network training by combining NAG-style momentum with periodic resets, outperforming standard SGD.
Contribution
SRSGD is a new NAG-inspired method that stabilizes increasing momentum through scheduled resets, leading to faster convergence and better accuracy in training deep neural networks.
Findings
SRSGD improves convergence speed over standard SGD.
SRSGD achieves lower error rates on ImageNet and CIFAR datasets.
SRSGD requires fewer epochs to reach comparable or better accuracy.
Abstract
Stochastic gradient descent (SGD) with constant momentum and its variants such as Adam are the optimization algorithms of choice for training deep neural networks (DNNs). Since DNN training is incredibly computationally expensive, there is great interest in speeding up the convergence. Nesterov accelerated gradient (NAG) improves the convergence rate of gradient descent (GD) for convex optimization using a specially designed momentum; however, it accumulates error when an inexact gradient is used (such as in SGD), slowing convergence at best and diverging at worst. In this paper, we propose Scheduled Restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning
MethodsNesterov Accelerated Gradient · Adam · Stochastic Gradient Descent
