Aryl: An Elastic Cluster Scheduler for Deep Learning

Jiamin Li; Hong Xu; Yibo Zhu; Zherui Liu; Chuanxiong Guo; Cong Wang

arXiv:2202.07896·cs.DC·November 12, 2024·5 cites

Aryl: An Elastic Cluster Scheduler for Deep Learning

Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, Cong Wang

PDF

Open Access

TL;DR

Aryl is a novel elastic cluster scheduler that improves GPU utilization and reduces training and inference job latency by introducing capacity loaning and elastic scaling, managed through heuristics to minimize preemptions and job completion time.

Contribution

It presents a new scheduler that combines capacity loaning and elastic scaling with heuristics to optimize GPU cluster utilization and job latency in deep learning workloads.

Findings

01

Reduces average queuing time by 1.53x

02

Decreases job completion time by 1.50x

03

Increases cluster usage by up to 26.9%

Abstract

Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Distributed and Parallel Computing Systems