Hybrid Workload Scheduling on HPC Systems

Yuping Fan; Paul Rich; William Allcock; Michael Papka and; Zhiling Lan

arXiv:2109.05412·cs.DC·September 14, 2021

Hybrid Workload Scheduling on HPC Systems

Yuping Fan, Paul Rich, William Allcock, Michael Papka and, Zhiling Lan

PDF

Open Access

TL;DR

This paper proposes and evaluates new scheduling mechanisms for efficiently managing hybrid workloads, including on-demand, rigid, and malleable jobs, on a single HPC system to improve responsiveness and overall system performance.

Contribution

It introduces novel scheduling strategies for co-scheduling diverse HPC workloads on a single system, addressing responsiveness, malleability incentives, and performance tradeoffs.

Findings

01

Proposed mechanisms effectively reduce on-demand request delays.

02

Incentives for malleable job declarations are increased.

03

Overall system performance is improved under various workloads.

Abstract

Traditionally, on-demand, rigid, and malleable applications have been scheduled and executed on separate systems. The ever-growing workload demands and rapidly developing HPC infrastructure trigger the interest of converging these applications on a single HPC system. Although allocating the hybrid workloads within one system could potentially improve system efficiency, it is difficult to balance the tradeoff between the responsiveness of on-demand requests, the incentive for malleable jobs, and the performance of rigid applications. In this study, we present several scheduling mechanisms to address the issues involved in co-scheduling on-demand, rigid, and malleable jobs on a single HPC system. We extensively evaluate and compare their performance under various configurations and workloads. Our experimental results show that our proposed mechanisms are capable of serving on-demand…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques