A Bayesian Perspective on Generalization and Stochastic Gradient Descent
Samuel L. Smith, Quoc V. Le

TL;DR
This paper investigates why stochastic gradient descent finds minima that generalize well, revealing that the Bayesian evidence favors flat minima and that an optimal batch size exists proportional to learning rate and dataset size.
Contribution
It introduces a Bayesian explanation for generalization in neural networks and derives a formula for the optimal batch size based on noise scale, validated empirically.
Findings
Bayesian evidence favors flat minima for better generalization.
An optimal batch size proportional to learning rate and dataset size maximizes test accuracy.
Empirical validation confirms the theoretical predictions.
Abstract
We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to Zhang et al. (2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show that the same phenomenon occurs in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. We also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. We propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning
