On the Generalization Benefit of Noise in Stochastic Gradient Descent

Samuel L. Smith; Erich Elsen; Soham De

arXiv:2006.15081·cs.LG·June 29, 2020·22 cites

On the Generalization Benefit of Noise in Stochastic Gradient Descent

Samuel L. Smith, Erich Elsen, Soham De

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that noise in stochastic gradient descent can improve neural network generalization, showing small to moderate batch sizes outperform large ones even with equal training iterations, supported by experiments and theory.

Contribution

It provides rigorous experiments and a theoretical framework confirming the generalization benefits of noise in SGD, countering recent skepticism.

Findings

01

Small/moderate batch sizes outperform large batches on test set.

02

Noise in SGD enhances generalization.

03

Optimal learning rate schedules vary with epoch budget.

Abstract

It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Generalization Benefit of Noise in Stochastic Gradient Descent· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsStochastic Gradient Descent