Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash

Hiroaki Mikami; Hisahiro Suganuma; Pongsakorn U-chupala; Yoshiki; Tanaka; Yuichi Kageyama

arXiv:1811.05233·cs.LG·March 6, 2019·73 cites

Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash

Hiroaki Mikami, Hisahiro Suganuma, Pongsakorn U-chupala, Yoshiki, Tanaka, Yuichi Kageyama

PDF

Open Access

TL;DR

This paper presents a scalable distributed training method for ImageNet/ResNet-50 that combines batch-size control, label smoothing, and 2D-Torus all-reduce to achieve rapid training times on large GPU clusters.

Contribution

It introduces a novel combination of techniques, including 2D-Torus all-reduce and batch-size control, to enable fast, stable large-scale distributed training of deep neural networks.

Findings

01

Trained ImageNet/ResNet-50 in 122 seconds on a large GPU cluster.

02

Achieved rapid training without significant accuracy loss.

03

Demonstrated effectiveness of 2D-Torus all-reduce in reducing synchronization overhead.

Abstract

Scaling the distributed deep learning to a massive GPU cluster level is challenging due to the instability of the large mini-batch training and the overhead of the gradient synchronization. We address the instability of the large mini-batch training with batch-size control and label smoothing. We address the overhead of the gradient synchronization with 2D-Torus all-reduce. Specifically, 2D-Torus all-reduce arranges GPUs in a logical 2D grid and performs a series of collective operation in different orientations. These two techniques are implemented with Neural Network Libraries (NNL). We have successfully trained ImageNet/ResNet-50 in 122 seconds without significant accuracy loss on ABCI cluster.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM