Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Doll\'ar, Ross Girshick, Pieter Noordhuis, Lukasz, Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He

TL;DR
This paper demonstrates that with proper hyper-parameter tuning and warmup strategies, large minibatch SGD can train ImageNet models efficiently on 256 GPUs in one hour without accuracy loss.
Contribution
The authors introduce a hyper-parameter-free linear scaling rule and a new warmup scheme to enable large minibatch training without accuracy degradation.
Findings
Large minibatches up to 8192 images do not reduce accuracy.
The proposed methods achieve 90% scaling efficiency from 8 to 256 GPUs.
ResNet-50 training on ImageNet is completed in one hour.
Abstract
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · AI in cancer detection
MethodsStochastic Gradient Descent
