Stochastic Training is Not Necessary for Generalization
Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, Tom, Goldstein

TL;DR
This paper shows that full-batch training without stochasticity can match SGD's generalization performance in neural networks, challenging the belief that stochastic regularization is essential.
Contribution
It demonstrates that explicit regularization can replace implicit regularization of SGD, and that full-batch training can achieve similar results with proper tuning.
Findings
Full-batch training achieves comparable performance to SGD on CIFAR-10.
Implicit regularization of SGD can be replaced with explicit regularization.
Perceived difficulty of full-batch training may be due to optimization properties.
Abstract
It is widely believed that the implicit regularization of SGD is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline. Our observations indicate that the perceived difficulty of full-batch training may be the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
MethodsStochastic Gradient Descent
