Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
Yizhou Luo, Qiang Wang, Shaohuai Shi, Jiaxin Lai, Shuhan Qi, Jiajia, Zhang, Xuan Wang

TL;DR
This paper proposes a GPU job scheduling method that enables multiple deep learning jobs to share GPUs efficiently, reducing job completion times and improving resource utilization without preemption.
Contribution
It introduces a novel scheduling model and heuristic algorithm, SJF-BSBF, that optimally shares GPU resources among jobs while maintaining training accuracy.
Findings
Reduces average job completion time by 27-33% compared to state-of-the-art schedulers.
Outperforms aggressive GPU sharing baseline by up to 17% in large-scale traces.
Effectively balances resource sharing benefits with deep learning convergence requirements.
Abstract
Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster utilization through periodic reallocation or selection of GPU resources, they also encounter challenges such as preemption and migration overhead, along with potential DL accuracy degradation. Nonetheless, few explore the potential benefits of GPU sharing to improve resource utilization and reduce job queuing times. Motivated by these insights, we present a job scheduling model allowing multiple jobs to share the same set of GPUs without altering job training settings. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Scheduling and Optimization Algorithms · IoT and Edge/Fog Computing
