Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Roberto Rocco; Davide Gadioli; Gianluca Palermo

arXiv:2104.14246·cs.DC·June 22, 2021

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Roberto Rocco, Davide Gadioli, Gianluca Palermo

PDF

1 Repo

TL;DR

Legio simplifies adding fault resiliency to embarrassingly parallel MPI applications by transparently integrating ULFM, enabling continued execution after node failures with minimal overhead and scalability impact.

Contribution

Legio provides a transparent, non-intrusive framework that hides ULFM complexity, facilitating fault resiliency in MPI applications without requiring code modifications.

Findings

01

Negligible overhead introduced by Legio on large-scale MPI runs

02

Legio maintains MPI scalability despite fault recovery features

03

Successful integration and fault injection testing on real-world applications

Abstract

Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more constraining. With the introduction of ULFM (User Level Fault Mitigation library), it has been provided with a possible way to overtake a fault during the application execution at the cost of code modifications. ULFM is intrusive in the application and requires also a deep understanding of its recovery procedures. In this paper we propose Legio, a framework that lowers the complexity of introducing resiliency in an embarrassingly parallel MPI application. By hiding ULFM behind the MPI calls, the library is capable to expose resiliency features to the application in a transparent manner thus removing any integration effort. Upon fault, the failed nodes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Robyroc/Legio
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.