Implicit Actions and Non-blocking Failure Recovery with MPI
Aurelien Bouteiller, George Bosilca

TL;DR
This paper advances MPI fault tolerance by enabling asynchronous, non-blocking failure recovery, allowing multiple components to recover simultaneously and overlap recovery with ongoing computations, thus improving resilience and efficiency.
Contribution
It introduces mechanisms for consistent fault reporting, scoped recovery, and overlapping recovery activities, enhancing MPI ULFM's ability to support asynchronous, non-blocking failure recovery.
Findings
Enables applications to assess computational success without performance penalties.
Allows independent recovery of application components and groups.
Overlaps system and application recovery activities for efficiency.
Abstract
Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap. This work proposes to: (1) provide the required consistency in fault reporting to applications (i.e., enable an application to assess the success of a computational phase without incurring an unacceptable performance hit); (2) bring forward the building blocks that permit the effective scoping of fault recovery in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Software System Performance and Reliability
