TL;DR
This paper addresses the challenge of efficiently scheduling jobs with burst buffer requirements in HPC systems, proposing a plan-based algorithm that significantly improves waiting times and slowdown metrics.
Contribution
It introduces a burst-buffer-aware plan-based scheduling algorithm with simulated annealing optimization, enhancing job scheduling efficiency in HPC environments.
Findings
Burst buffer reservations are crucial for scheduling efficiency.
The proposed algorithm reduces mean waiting time by over 20%.
The algorithm decreases mean bounded slowdown by 27%.
Abstract
The ever-increasing gap between compute and I/O performance in HPC platforms, together with the development of novel NVMe storage devices (NVRAM), led to the emergence of the burst buffer concept - an intermediate persistent storage layer logically positioned between random-access main memory and a parallel file system. Despite the development of real-world architectures as well as research concepts, resource and job management systems, such as Slurm, provide only marginal support for scheduling jobs with burst buffer requirements, in particular ignoring burst buffers when backfilling. We investigate the impact of burst buffer reservations on the overall efficiency of online job scheduling for common algorithms: First-Come-First-Served (FCFS) and Shortest-Job-First (SJF) EASY-backfilling. We evaluate the algorithms in a detailed simulation with I/O side effects. Our results indicate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
