On Scale-out Deep Learning Training for Cloud and HPC
Srinivas Sridharan, Karthikeyan Vaidyanathan, Dhiraj Kalamkar,, Dipankar Das, Mikhail E. Smorkalov, Mikhail Shiryaev, Dheevatsa Mudigere,, Naveen Mellempudi, Sasikanth Avancha, Bharat Kaul, Pradeep Dubey

TL;DR
This paper discusses the design and implementation of the Intel MLSL library, enabling scalable distributed deep learning training across cloud and HPC systems, addressing the challenges of scaling synchronous SGD.
Contribution
It introduces the MLSL library and demonstrates its effectiveness in scaling deep learning training on hundreds to thousands of nodes.
Findings
Successful scaling of DL training on 100s to 1000s of nodes
Demonstrated efficiency across cloud and HPC systems
Addressed challenges in synchronous SGD scaling
Abstract
The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved through scalable and efficient distributed training, since a single node/card cannot satisfy the compute, memory, and I/O requirements of today's state-of-the-art deep neural networks. However, scaling synchronous Stochastic Gradient Descent (SGD) is still a challenging problem and requires continued research/development. This entails innovations spanning algorithms, frameworks, communication libraries, and system design. In this paper, we describe the philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) and present proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
