Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster   Scheduling

Xinyi Zhang; Hanyu Zhao; Wencong Xiao; Xianyan Jia; Fei Xu; Yong Li,; Wei Lin; Fangming Liu

arXiv:2408.08586·cs.DC·August 19, 2024

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

Xinyi Zhang, Hanyu Zhao, Wencong Xiao, Xianyan Jia, Fei Xu, Yong Li,, Wei Lin, Fangming Liu

PDF

Open Access

TL;DR

Rubick is a cluster scheduling system that dynamically reconfigures deep learning training jobs to optimize resource utilization and performance, significantly reducing job completion times on GPU clusters.

Contribution

It introduces a novel scheduling approach that incorporates execution plan reconfiguration and joint resource tuning, improving efficiency over existing black-box methods.

Findings

01

Up to 3.2x faster job completion time

02

Up to 1.4x reduction in makespan

03

Effective performance guarantees for jobs

Abstract

The era of large deep learning models has given rise to advanced training strategies such as 3D parallelism and the ZeRO series. These strategies enable various (re-)configurable execution plans for a training job, which exhibit remarkably different requirements of multiple resource types. Existing cluster scheduling systems, however, treat such reconfigurable training jobs as black boxes: they rely on users to choose execution plans statically, and then make resource allocations without awareness of the chosen plans and their resource requirements. This approach results in mismatches between execution plans and resources, making both training performance and cluster utilization far from optimal. We introduce Rubick, a cluster scheduling system for deep learning training that exploits the reconfigurability to improve job performance and cluster efficiency. Rubick incorporates the job…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management