Plan-based Job Scheduling for Supercomputers with Shared Burst Buffers

Jan Kopanski; Krzysztof Rzadca

arXiv:2109.00082·cs.DC·January 11, 2022

Plan-based Job Scheduling for Supercomputers with Shared Burst Buffers

Jan Kopanski, Krzysztof Rzadca

PDF

1 Repo

TL;DR

This paper addresses the challenge of efficiently scheduling jobs with burst buffer requirements in HPC systems, proposing a plan-based algorithm that significantly improves waiting times and slowdown metrics.

Contribution

It introduces a burst-buffer-aware plan-based scheduling algorithm with simulated annealing optimization, enhancing job scheduling efficiency in HPC environments.

Findings

01

Burst buffer reservations are crucial for scheduling efficiency.

02

The proposed algorithm reduces mean waiting time by over 20%.

03

The algorithm decreases mean bounded slowdown by 27%.

Abstract

The ever-increasing gap between compute and I/O performance in HPC platforms, together with the development of novel NVMe storage devices (NVRAM), led to the emergence of the burst buffer concept - an intermediate persistent storage layer logically positioned between random-access main memory and a parallel file system. Despite the development of real-world architectures as well as research concepts, resource and job management systems, such as Slurm, provide only marginal support for scheduling jobs with burst buffer requirements, in particular ignoring burst buffers when backfilling. We investigate the impact of burst buffer reservations on the overall efficiency of online job scheduling for common algorithms: First-Come-First-Served (FCFS) and Shortest-Job-First (SJF) EASY-backfilling. We evaluate the algorithms in a detailed simulation with I/O side effects. Our results indicate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jankopanski/burst-buffer-scheduling
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.