Aryl: An Elastic Cluster Scheduler for Deep Learning
Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, Cong Wang

TL;DR
Aryl is a novel elastic cluster scheduler that improves GPU utilization and reduces training and inference job latency by introducing capacity loaning and elastic scaling, managed through heuristics to minimize preemptions and job completion time.
Contribution
It presents a new scheduler that combines capacity loaning and elastic scaling with heuristics to optimize GPU cluster utilization and job latency in deep learning workloads.
Findings
Reduces average queuing time by 1.53x
Decreases job completion time by 1.50x
Increases cluster usage by up to 26.9%
Abstract
Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Distributed and Parallel Computing Systems
