Parallel Spawning Strategies for Dynamic-Aware MPI Applications
Iker Mart\'in-\'Alvarez, Jos\'e I. Aliaga, Maribel Castillo

TL;DR
This paper introduces a novel parallel spawning strategy for MPI applications that efficiently manages dynamic resource reallocation by reusing processes and terminating unneeded ones, significantly reducing reconfiguration costs.
Contribution
It proposes a cooperative parallel spawning approach that overcomes existing limitations in MPI malleability, enabling efficient expansion and shrinking of resources during runtime.
Findings
Strategy maintains competitive expansion times with minimal overhead.
Enables shrink operations that are at least 1387 times faster.
Validates effectiveness on systems with equal and different core counts.
Abstract
Dynamic resource management is an increasingly important capability of High Performance Computing systems, as it enables jobs to adjust their resource allocation at runtime. This capability can reduce workload makespan, substantially decreasing job waiting times and optimizing resource allocation. In this context, malleability refers to the ability of applications to adapt to new resource allocations during execution. Although beneficial, malleability incurs significant reconfiguration costs, making the reduction of these costs an important research topic. Some existing solutions for MPI applications respawn the entire application, which is an expensive solution that avoids the reuse of original processes. Other MPI solutions reuse them, but fail to fully release unneeded processes when shrinking, since some ranks within the same communicator remain active across nodes, preventing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
