DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chen Meng, Wei Lin

TL;DR
DL2 is a novel deep learning-based scheduler designed for DL clusters that combines supervised and reinforcement learning to optimize resource allocation and significantly reduce training times.
Contribution
The paper introduces DL2, a generic DL-driven scheduler that uses a hybrid learning approach to improve resource scheduling efficiency in DL clusters.
Findings
DL2 reduces average training completion time by up to 44.1%.
DL2 outperforms existing schedulers like DRF and Optimus.
The approach enables dynamic resource scaling in DL jobs.
Abstract
More and more companies have deployed machine learning (ML) clusters, where deep learning (DL) models are trained for providing various AI-driven services. Efficient resource scheduling is essential for maximal utilization of expensive DL clusters. Existing cluster schedulers either are agnostic to ML workload characteristics, or use scheduling heuristics based on operators' understanding of particular ML framework and workload, which are less efficient or not general enough. In this paper, we show that DL techniques can be adopted to design a generic and efficient scheduler. DL2 is a DL-driven scheduler for DL clusters, targeting global training job expedition by dynamically resizing resources allocated to jobs. DL2 advocates a joint supervised learning and reinforcement learning approach: a neural network is warmed up via offline supervised learning based on job traces produced by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Cloud Computing and Resource Management · Stochastic Gradient Optimization Techniques
