Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash
Hiroaki Mikami, Hisahiro Suganuma, Pongsakorn U-chupala, Yoshiki, Tanaka, Yuichi Kageyama

TL;DR
This paper presents a scalable distributed training method for ImageNet/ResNet-50 that combines batch-size control, label smoothing, and 2D-Torus all-reduce to achieve rapid training times on large GPU clusters.
Contribution
It introduces a novel combination of techniques, including 2D-Torus all-reduce and batch-size control, to enable fast, stable large-scale distributed training of deep neural networks.
Findings
Trained ImageNet/ResNet-50 in 122 seconds on a large GPU cluster.
Achieved rapid training without significant accuracy loss.
Demonstrated effectiveness of 2D-Torus all-reduce in reducing synchronization overhead.
Abstract
Scaling the distributed deep learning to a massive GPU cluster level is challenging due to the instability of the large mini-batch training and the overhead of the gradient synchronization. We address the instability of the large mini-batch training with batch-size control and label smoothing. We address the overhead of the gradient synchronization with 2D-Torus all-reduce. Specifically, 2D-Torus all-reduce arranges GPUs in a logical 2D grid and performs a series of collective operation in different orientations. These two techniques are implemented with Neural Network Libraries (NNL). We have successfully trained ImageNet/ResNet-50 in 122 seconds without significant accuracy loss on ABCI cluster.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM
