Towards Scalable Distributed Training of Deep Learning on Public Cloud   Clusters

Shaohuai Shi; Xianhao Zhou; Shutao Song; Xingyao Wang; Zilin Zhu; Xue; Huang; Xinan Jiang; Feihu Zhou; Zhenyu Guo; Liqiang Xie; Rui Lan; Xianbin; Ouyang; Yan Zhang; Jieqian Wei; Jing Gong; Weiliang Lin; Ping Gao; Peng Meng,; Xiaomin Xu; Chenyang Guo; Bo Yang; Zhibo Chen; Yongjian Wu; Xiaowen Chu

arXiv:2010.10458·cs.DC·October 21, 2020·25 cites

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xue, Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin, Ouyang, Yan Zhang, Jieqian Wei, Jing Gong, Weiliang Lin, Ping Gao, Peng Meng,, Xiaomin Xu, Chenyang Guo, Bo Yang, Zhibo Chen

PDF

Open Access

TL;DR

This paper introduces a new communication-efficient method and system optimizations for distributed deep learning training on public cloud clusters, achieving significant speedups and record-breaking results.

Contribution

It proposes a top-k sparsification communication library and system-level optimizations to improve scalability and efficiency of distributed training on public cloud GPU clusters.

Findings

01

Achieves 25%-40% faster training than existing systems.

02

Breaks DAWNBench record for ResNet-50 training on ImageNet.

03

Demonstrates effectiveness on Tencent Cloud GPU clusters.

Abstract

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Advanced Neural Network Applications · IoT and Edge/Fog Computing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax · Adam · Layer Normalization · Dense Connections · Multi-Head Attention · Label Smoothing