TL;DR
Legio simplifies adding fault resiliency to embarrassingly parallel MPI applications by transparently integrating ULFM, enabling continued execution after node failures with minimal overhead and scalability impact.
Contribution
Legio provides a transparent, non-intrusive framework that hides ULFM complexity, facilitating fault resiliency in MPI applications without requiring code modifications.
Findings
Negligible overhead introduced by Legio on large-scale MPI runs
Legio maintains MPI scalability despite fault recovery features
Successful integration and fault injection testing on real-world applications
Abstract
Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more constraining. With the introduction of ULFM (User Level Fault Mitigation library), it has been provided with a possible way to overtake a fault during the application execution at the cost of code modifications. ULFM is intrusive in the application and requires also a deep understanding of its recovery procedures. In this paper we propose Legio, a framework that lowers the complexity of introducing resiliency in an embarrassingly parallel MPI application. By hiding ULFM behind the MPI calls, the library is capable to expose resiliency features to the application in a transparent manner thus removing any integration effort. Upon fault, the failed nodes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
