On Scale-out Deep Learning Training for Cloud and HPC

Srinivas Sridharan; Karthikeyan Vaidyanathan; Dhiraj Kalamkar,; Dipankar Das; Mikhail E. Smorkalov; Mikhail Shiryaev; Dheevatsa Mudigere,; Naveen Mellempudi; Sasikanth Avancha; Bharat Kaul; Pradeep Dubey

arXiv:1801.08030·cs.DC·January 25, 2018·19 cites

On Scale-out Deep Learning Training for Cloud and HPC

Srinivas Sridharan, Karthikeyan Vaidyanathan, Dhiraj Kalamkar,, Dipankar Das, Mikhail E. Smorkalov, Mikhail Shiryaev, Dheevatsa Mudigere,, Naveen Mellempudi, Sasikanth Avancha, Bharat Kaul, Pradeep Dubey

PDF

Open Access

TL;DR

This paper discusses the design and implementation of the Intel MLSL library, enabling scalable distributed deep learning training across cloud and HPC systems, addressing the challenges of scaling synchronous SGD.

Contribution

It introduces the MLSL library and demonstrates its effectiveness in scaling deep learning training on hundreds to thousands of nodes.

Findings

01

Successful scaling of DL training on 100s to 1000s of nodes

02

Demonstrated efficiency across cloud and HPC systems

03

Addressed challenges in synchronous SGD scaling

Abstract

The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved through scalable and efficient distributed training, since a single node/card cannot satisfy the compute, memory, and I/O requirements of today's state-of-the-art deep neural networks. However, scaling synchronous Stochastic Gradient Descent (SGD) is still a challenging problem and requires continued research/development. This entails innovations spanning algorithms, frameworks, communication libraries, and system design. In this paper, we describe the philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) and present proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques