MPI Malleability Validation under Replayed Real-World HPC Conditions
S. Iserte, M. Madon, G. Da, J. Pierson, A. J. Pe\~na

TL;DR
This paper presents a methodology to validate MPI malleability in real HPC environments by replaying workload logs, demonstrating a 27% workload time reduction without delaying baseline jobs.
Contribution
It introduces a novel validation approach for DRM techniques like malleability using real workload replay on HPC systems.
Findings
Validated the methodology on a 125-node HPC cluster.
Achieved 27% reduction in workload time with malleability.
Maintained resource utilization rate despite queueing delays.
Abstract
Dynamic Resource Management (DRM) techniques can be leveraged to maximize throughput and resource utilization in computational clusters. Although DRM has been extensively studied through analytical workloads and simulations, skepticism persists among end administrators and users regarding their feasibility under real-world conditions. To address this problem, we propose a novel methodology for validating DRM techniques, such as malleability, in realistic scenarios that reproduce actual cluster conditions of jobs and users by replaying workload logs on a High-performance Computing (HPC) infrastructure. Our methodology is capable of adapting the workload to the target cluster. We evaluate our methodology in a malleability-enabled 125-node partition of the Marenostrum 5 supercomputer. Our results validate the proposed method and assess the benefits of MPI malleability on a novel use case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
