Accelerating SGD with momentum for over-parameterized learning

Chaoyue Liu; Mikhail Belkin

arXiv:1810.13395·cs.LG·September 30, 2019·36 cites

Accelerating SGD with momentum for over-parameterized learning

Chaoyue Liu, Mikhail Belkin

PDF

Open Access 1 Repo

TL;DR

This paper introduces MaSS, a modified Nesterov SGD algorithm with a compensation term, which achieves accelerated convergence over standard SGD in over-parameterized neural network training, supported by theoretical analysis and empirical results.

Contribution

The paper proposes MaSS, a new variant of Nesterov SGD with a compensation term, providing convergence guarantees and acceleration in over-parameterized learning.

Findings

01

MaSS converges with the same step sizes as SGD.

02

MaSS achieves accelerated convergence rates in the linear setting.

03

Experimental results show MaSS outperforms SGD, Nesterov SGD, and Adam on deep networks.

Abstract

Nesterov SGD is widely used for training modern neural networks and other machine learning models. Yet, its advantages over SGD have not been theoretically clarified. Indeed, as we show in our paper, both theoretically and empirically, Nesterov SGD with any parameter selection does not in general provide acceleration over ordinary SGD. Furthermore, Nesterov SGD may diverge for step sizes that ensure convergence of ordinary SGD. This is in contrast to the classical results in the deterministic scenario, where the same step size ensures accelerated convergence of the Nesterov's method over optimal gradient descent. To address the non-acceleration issue, we introduce a compensation term to Nesterov SGD. The resulting algorithm, which we call MaSS, converges for same step sizes as SGD. We prove that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ts66395/MaSS
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning

MethodsAverage Pooling · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling