Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads

Patrick Zojer; Jonas Posner; Taylan \"Ozden

arXiv:2602.17318·cs.DC·February 20, 2026

Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads

Patrick Zojer, Jonas Posner, Taylan \"Ozden

PDF

Open Access

TL;DR

This paper evaluates the impact of resource-malleable job scheduling in HPC clusters using real workloads, demonstrating significant efficiency improvements even with partial adoption of malleability strategies.

Contribution

It introduces a comprehensive simulation study of malleable job scheduling strategies in HPC, highlighting their benefits and optimal configurations based on real-world workload traces.

Findings

01

Job turnaround times decrease by up to 67%.

02

Job wait times reduce by up to 99%.

03

Node utilization improves significantly, up to 52%.

Abstract

Optimizing resource utilization in high-performance computing (HPC) clusters is essential for maximizing both system efficiency and user satisfaction. However, traditional rigid job scheduling often results in underutilized resources and increased job waiting times. This work evaluates the benefits of resource elasticity, where the job scheduler dynamically adjusts the resource allocation of malleable jobs at runtime. Using real workload traces from the Cori, Eagle, and Theta supercomputers, we simulate varying proportions (0-100%) of malleable jobs with the ElastiSim software. We evaluate five job scheduling strategies, including a novel one that maintains malleable jobs at their preferred resource allocation when possible. Results show that, compared to fully rigid workloads, malleable jobs yield significant improvements across all key metrics. Considering the best-performing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques