Building a fault tolerant application using the GASPI communication layer
Faisal Shahzad, Moritz Kreutzer, Thomas Zeiser, Rui Machado, Andreas, Pieper, Georg Hager, Gerhard Wellein

TL;DR
This paper demonstrates how to build fault-tolerant applications using the GASPI communication layer, focusing on low-cost failure detection and recovery mechanisms suitable for exascale computing.
Contribution
It introduces a fault-tolerant application framework based on GASPI, extending checkpointing with efficient failure detection and recovery, and analyzes its overhead and scalability.
Findings
Failure detection causes no overhead in failure-free runs
Recovery overhead is acceptable and scales well
The approach effectively handles process failures in exascale environments
Abstract
It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. MPI is not yet very mature in handling failures, the User-Level Failure Mitigation (ULFM) proposal being currently the most promising approach is still in its prototype phase. In our work we use GASPI, which is a relatively new communication library based on the PGAS model. It provides the missing features to allow the design of fault-tolerant applications. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how we can build on (existing) clever checkpointing and extend…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Parallel Computing and Optimization Techniques · Age of Information Optimization
