Resource Optimization with MPI Process Malleability for Dynamic Workloads in HPC Clusters
Sergio Iserte, Iker Mart\'in-\'Alvarez, Krzysztof Rojek, Jos\'e I. Aliaga, Maribel Castillo, Weronika Folwarska, and Antonio J. Pe\~na

TL;DR
This paper enhances resource management in HPC clusters by integrating malleability techniques into MPI processes, enabling dynamic reconfiguration that significantly improves efficiency and reduces workload completion time.
Contribution
It introduces new malleability strategies and integrates them into the DMR framework, advancing dynamic resource management in HPC environments.
Findings
Reduced workload completion time by 40%
Increased resource utilization by over 20%
Demonstrated effectiveness on a supercomputer
Abstract
Dynamic resource management is essential for optimizing computational efficiency in modern high-performance computing (HPC) environments, particularly as systems scale. While research has demonstrated the benefits of malleability in resource management systems (RMS), the adoption of such techniques in production environments remains limited due to challenges in standardization, interoperability, and usability. Addressing these gaps, this paper extends our prior work on the Dynamic Management of Resources (DMR) framework, which provides a modular and user-friendly approach to dynamic resource allocation. Building upon the original DMRlib reconfiguration runtime, this work integrates new methodology from the Malleability Module (MaM) of the Proteo framework, further enhancing reconfiguration capabilities with new spawning strategies and data redistribution methods. In this paper, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
