Themis: A Network Bandwidth-Aware Collective Scheduling Policy for   Distributed Training of DL Models

Saeed Rashidi; William Won; Sudarshan Srinivasan; Srinivas Sridharan,; Tushar Krishna

arXiv:2110.04478·cs.DC·July 8, 2022·1 cites

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan,, Tushar Krishna

PDF

Open Access

TL;DR

Themis is a novel collective scheduling policy that dynamically balances communication loads across heterogeneous network dimensions, significantly improving bandwidth utilization and training performance in distributed deep learning systems.

Contribution

Themis introduces a dynamic collective scheduling scheme that optimizes network bandwidth utilization across multiple dimensions in heterogeneous environments.

Findings

01

Improves network bandwidth utilization by 1.72X on average.

02

Enhances training iteration performance for various workloads by up to 2.25X.

03

Effectively balances communication loads across network dimensions.

Abstract

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques