GPU Cluster Scheduling for Network-Sensitive Deep Learning
Aakash Sharma, Vivek M. Bhasi, Sonali Singh, George Kesidis, Mahmut T. Kandemir, Chita R. Das

TL;DR
This paper introduces a GPU-cluster scheduler tailored for distributed deep learning workloads that optimizes job placement based on network sensitivity, significantly improving training efficiency and reducing communication overheads.
Contribution
The paper presents a novel GPU-cluster scheduler with proximity-based consolidation, network-sensitive preemption, and auto-tuning, validated through a data-driven simulation platform.
Findings
Up to 69% improvement in end-to-end makespan.
Up to 83% reduction in average job completion time.
Up to 98% decrease in communication overheads.
Abstract
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification
