TL;DR
This paper introduces an extension to C++ MPI programming that enables exception propagation to handle errors and faults more gracefully, improving robustness and supporting future fault-tolerant MPI features.
Contribution
It presents a novel approach for propagating exceptions in C++ MPI programs, integrating with MPI-ULFM for fault tolerance, and demonstrates a prototype implementation using MPI-3.0 features.
Findings
Enables exception propagation across MPI processes to prevent deadlocks.
Maps MPI failures to local exceptions for better error handling.
Supports asynchronous local failure recovery and future fault-tolerant MPI extensions.
Abstract
C++ advocates exceptions as the preferred way to handle unexpected behaviour of an implementation in the code. This does not integrate well with the error handling of MPI, which more or less always results in program termination in case of MPI failures. In particular, a local C++ exception can currently lead to a deadlock due to unfinished communication requests on remote hosts. At the same time, future MPI implementations are expected to include an API to continue computations even after a hard fault (node loss), i.e. the worst possible unexpected behaviour. In this paper we present an approach that adds extended exception propagation support to C++ MPI programs. Our technique allows to propagate local exceptions to remote hosts to avoid deadlocks, and to map MPI failures on remote hosts to local exceptions. A use case of particular interest are asynchronous 'local failure local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
