On Large-Batch Training for Deep Learning: Generalization Gap and Sharp   Minima

Nitish Shirish Keskar; Dheevatsa Mudigere; Jorge Nocedal; Mikhail; Smelyanskiy; Ping Tak Peter Tang

arXiv:1609.04836·cs.LG·February 13, 2017·579 cites

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail, Smelyanskiy, Ping Tak Peter Tang

PDF

Open Access 5 Repos

TL;DR

This paper investigates why large-batch training in deep learning often results in poorer generalization, linking it to convergence to sharp minima, and explores strategies to mitigate this issue.

Contribution

It provides numerical evidence that large-batch methods tend to find sharp minima, explaining the generalization gap, and discusses strategies to promote flat minima in large-batch training.

Findings

01

Large-batch methods converge to sharp minima associated with poorer generalization.

02

Small-batch methods tend to find flat minima, leading to better generalization.

03

Gradient noise plays a key role in enabling convergence to flat minima.

Abstract

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$ - $512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning