On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters   with Communication Contention

Menglu Yu; Bo Ji; Hridesh Rajan; Jia Liu

arXiv:2207.07817·cs.DC·August 16, 2022

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Menglu Yu, Bo Ji, Hridesh Rajan, Jia Liu

PDF

Open Access

TL;DR

This paper introduces a theoretical framework and scheduling algorithm for efficiently managing multiple ring-all-reduce deep learning jobs in GPU clusters, reducing communication contention and improving training efficiency.

Contribution

It develops a new analytical model for communication overhead and contention, formulates a contention-aware scheduling problem, and proposes an approximation algorithm with proven effectiveness.

Findings

01

SJF-BCO outperforms existing schedulers in reducing makespan.

02

The analytical model accurately predicts communication overhead and contention.

03

The proposed algorithm is effective in multi-tenant GPU cluster environments.

Abstract

Powered by advances in deep learning (DL) techniques, machine learning and artificial intelligence have achieved astonishing successes. However, the rapidly growing needs for DL also led to communication- and resource-intensive distributed training jobs for large-scale DL training, which are typically deployed over GPU clusters. To sustain the ever-increasing demand for DL training, the so-called "ring-all-reduce" (RAR) technologies have recently emerged as a favorable computing architecture to efficiently process network communication and computation load in GPU clusters. The most salient feature of RAR is that it removes the need for dedicated parameter servers, thus alleviating the potential communication bottleneck. However, when multiple RAR-based DL training jobs are deployed over GPU clusters, communication bottlenecks could still occur due to contentions between DL training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · IoT and Edge/Fog Computing · Age of Information Optimization