Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
Wei Gao, Qinghao Hu, Zhisheng Ye, Peng Sun, Xiaolin Wang, Yingwei Luo,, Tianwei Zhang, Yonggang Wen

TL;DR
This paper surveys the landscape of GPU datacenter scheduling for deep learning workloads, highlighting challenges, existing solutions, and future research directions to optimize resource utilization and reduce operational costs.
Contribution
It provides a comprehensive taxonomy and analysis of current scheduling strategies tailored for deep learning in GPU datacenters, addressing unique workload characteristics.
Findings
Traditional schedulers are inadequate for DL workloads.
Recent specialized schedulers improve resource utilization.
Future research directions include adaptive and intelligent scheduling.
Abstract
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
