On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail, Smelyanskiy, Ping Tak Peter Tang

TL;DR
This paper investigates why large-batch training in deep learning often results in poorer generalization, linking it to convergence to sharp minima, and explores strategies to mitigate this issue.
Contribution
It provides numerical evidence that large-batch methods tend to find sharp minima, explaining the generalization gap, and discusses strategies to promote flat minima in large-batch training.
Findings
Large-batch methods converge to sharp minima associated with poorer generalization.
Small-batch methods tend to find flat minima, leading to better generalization.
Gradient noise plays a key role in enabling convergence to flat minima.
Abstract
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say - data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning
