Highly Scalable Deep Learning Training System with Mixed-Precision:   Training ImageNet in Four Minutes

Xianyan Jia; Shutao Song; Wei He; Yangzihao Wang; Haidong Rong; Feihu; Zhou; Liqiang Xie; Zhenyu Guo; Yuanzhou Yang; Liwei Yu; Tiegang Chen,; Guangxiao Hu; Shaohuai Shi; Xiaowen Chu

arXiv:1807.11205·cs.LG·July 31, 2018·313 cites

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu, Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen,, Guangxiao Hu, Shaohuai Shi, Xiaowen Chu

PDF

Open Access

TL;DR

This paper introduces a highly scalable deep learning training system that leverages mixed-precision techniques and optimized algorithms to train large neural networks on massive GPU clusters in record time, notably training ImageNet in just a few minutes.

Contribution

The paper presents a novel mixed-precision training method, an optimization approach for extremely large mini-batch sizes, and highly optimized all-reduce algorithms for scalable GPU clusters.

Findings

01

Achieved ImageNet training in 6.6 minutes with 2048 GPUs.

02

Improved training throughput without accuracy loss using mixed-precision.

03

Significantly faster all-reduce algorithms outperform existing methods.

Abstract

Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing the communication-to-computation ratio, it may hurt the generalization ability of the models. To this end, we build a highly scalable deep learning training system for dense GPU clusters with three main contributions: (1) We propose a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy. (2) We propose an optimization approach for extremely large mini-batch size (up to 64k) that can train CNN models on the ImageNet dataset without losing accuracy. (3) We propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Artificial Intelligence in Healthcare and Education · Adversarial Robustness in Machine Learning

Methods1x1 Convolution · Convolution · Local Response Normalization · Grouped Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Dense Connections · Max Pooling · Softmax · How do I speak to a person at Expedia?-/+/