Analysis of Distributed Deep Learning in the Cloud

Aakash Sharma; Vivek M. Bhasi; Sonali Singh; Rishabh Jain; Jashwant; Raj Gunasekaran; Subrata Mitra; Mahmut Taylan Kandemir; George Kesidis; Chita; R. Das

arXiv:2208.14344·cs.LG·December 26, 2022·1 cites

Analysis of Distributed Deep Learning in the Cloud

Aakash Sharma, Vivek M. Bhasi, Sonali Singh, Rishabh Jain, Jashwant, Raj Gunasekaran, Subrata Mitra, Mahmut Taylan Kandemir, George Kesidis, Chita, R. Das

PDF

Open Access

TL;DR

This paper introduces a comprehensive profiler for distributed deep learning in the cloud, identifying communication stalls and providing insights into hardware performance and cost optimization.

Contribution

It extends existing profiling tools to estimate communication stalls and models DNN features' impact, aiding users in optimizing cloud-based DDL performance and costs.

Findings

01

Communication overheads can reach 90% of training time.

02

Network-connected instances can be up to 5x slower than single-instance training.

03

More expensive GPU instances are not always the most efficient for all models.

Abstract

We aim to resolve this problem by introducing a comprehensive distributed deep learning (DDL) profiler, which can determine the various execution "stalls" that DDL suffers from while running on a public cloud. We have implemented the profiler by extending prior work to additionally estimate two types of communication stalls - interconnect and network stalls. We train popular DNN models using the profiler to characterize various AWS GPU instances and list their advantages and shortcomings for users to make an informed decision. We observe that the more expensive GPU instances may not be the most performant for all DNN models and AWS may sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads up to 90% of DNN training time and network-connected instances can suffer from up to 5x slowdown compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Stochastic Gradient Optimization Techniques · Ferroelectric and Negative Capacitance Devices