Stochastic Training is Not Necessary for Generalization

Jonas Geiping; Micah Goldblum; Phillip E. Pope; Michael Moeller; Tom; Goldstein

arXiv:2109.14119·cs.LG·April 21, 2022·6 cites

Stochastic Training is Not Necessary for Generalization

Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, Tom, Goldstein

PDF

Open Access 1 Repo 2 Videos

TL;DR

This paper shows that full-batch training without stochasticity can match SGD's generalization performance in neural networks, challenging the belief that stochastic regularization is essential.

Contribution

It demonstrates that explicit regularization can replace implicit regularization of SGD, and that full-batch training can achieve similar results with proper tuning.

Findings

01

Full-batch training achieves comparable performance to SGD on CIFAR-10.

02

Implicit regularization of SGD can be replaced with explicit regularization.

03

Perceived difficulty of full-batch training may be due to optimization properties.

Abstract

It is widely believed that the implicit regularization of SGD is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline. Our observations indicate that the perceived difficulty of full-batch training may be the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jonasgeiping/fullbatchtraining
pytorchOfficial

Videos

[ML News] DeepMind does Nowcasting | The Guardian's shady reporting | AI finishes Beethoven's 10th· youtube

Stochastic Training is Not Necessary for Generalization· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning

MethodsStochastic Gradient Descent