GPU Cluster Scheduling for Network-Sensitive Deep Learning

Aakash Sharma; Vivek M. Bhasi; Sonali Singh; George Kesidis; Mahmut T. Kandemir; Chita R. Das

arXiv:2401.16492·cs.PF·November 11, 2025·1 cites

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Aakash Sharma, Vivek M. Bhasi, Sonali Singh, George Kesidis, Mahmut T. Kandemir, Chita R. Das

PDF

Open Access

TL;DR

This paper introduces a GPU-cluster scheduler tailored for distributed deep learning workloads that optimizes job placement based on network sensitivity, significantly improving training efficiency and reducing communication overheads.

Contribution

The paper presents a novel GPU-cluster scheduler with proximity-based consolidation, network-sensitive preemption, and auto-tuning, validated through a data-driven simulation platform.

Findings

01

Up to 69% improvement in end-to-end makespan.

02

Up to 83% reduction in average job completion time.

03

Up to 98% decrease in communication overheads.

Abstract

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification