Train longer, generalize better: closing the generalization gap in large batch training of neural networks
Elad Hoffer, Itay Hubara, Daniel Soudry

TL;DR
This paper investigates the causes of the generalization gap in large-batch neural network training, proposing a new training regime and a novel normalization technique to improve generalization without increasing updates.
Contribution
It introduces a statistical model explaining the gap, demonstrates that the gap is due to the number of updates rather than batch size, and proposes Ghost Batch Normalization to mitigate the issue.
Findings
The generalization gap is primarily due to insufficient training updates.
Adapting training regimes can eliminate the gap.
Ghost Batch Normalization significantly reduces the gap.
Abstract
Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomena. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the "generalization gap"…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Gaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis
