Accelerating SGD with momentum for over-parameterized learning
Chaoyue Liu, Mikhail Belkin

TL;DR
This paper introduces MaSS, a modified Nesterov SGD algorithm with a compensation term, which achieves accelerated convergence over standard SGD in over-parameterized neural network training, supported by theoretical analysis and empirical results.
Contribution
The paper proposes MaSS, a new variant of Nesterov SGD with a compensation term, providing convergence guarantees and acceleration in over-parameterized learning.
Findings
MaSS converges with the same step sizes as SGD.
MaSS achieves accelerated convergence rates in the linear setting.
Experimental results show MaSS outperforms SGD, Nesterov SGD, and Adam on deep networks.
Abstract
Nesterov SGD is widely used for training modern neural networks and other machine learning models. Yet, its advantages over SGD have not been theoretically clarified. Indeed, as we show in our paper, both theoretically and empirically, Nesterov SGD with any parameter selection does not in general provide acceleration over ordinary SGD. Furthermore, Nesterov SGD may diverge for step sizes that ensure convergence of ordinary SGD. This is in contrast to the classical results in the deterministic scenario, where the same step size ensures accelerated convergence of the Nesterov's method over optimal gradient descent. To address the non-acceleration issue, we introduce a compensation term to Nesterov SGD. The resulting algorithm, which we call MaSS, converges for same step sizes as SGD. We prove that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
MethodsAverage Pooling · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling
