Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling
Mohak Chadha, Jophin John, Michael Gerndt

TL;DR
This paper extends the SLURM batch system to support malleable jobs using Invasive MPI, enabling dynamic resource management for improved performance and power efficiency in HPC systems.
Contribution
It introduces a new adaptive parallel paradigm called Invasive MPI and implements two malleable job scheduling strategies in SLURM for performance and power optimization.
Findings
Performance-aware scheduling improves makespan and system utilization.
Power-aware strategy enables dynamic power corridor management.
Enhanced resource adaptivity benefits HPC system efficiency.
Abstract
With the growing constraints on power budget and increasing hardware failure rates, the operation of future exascale systems faces several challenges. Towards this, resource awareness and adaptivity by enabling malleable jobs has been actively researched in the HPC community. Malleable jobs can change their computing resources at runtime and can significantly improve HPC system performance. However, due to the rigid nature of popular parallel programming paradigms such as MPI and lack of support for dynamic resource management in batch systems, malleable jobs have been largely unrealized. In this paper, we extend the SLURM batch system to support the execution and batch scheduling of malleable jobs. The malleable applications are written using a new adaptive parallel paradigm called Invasive MPI which extends the MPI standard to support resource-adaptivity at runtime. We propose two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
