Soft Errors Detection and Automatic Recovery based on Replication combined with different Levels of Checkpointing
Diego Montezanti, Enzo Rucci, Armando De Giusti, Marcelo Naiouf,, Dolores Rexachs, Emilio Luque

TL;DR
The paper introduces SEDAR, a multi-level fault detection and recovery methodology for HPC systems, combining process replication with checkpointing to improve reliability against transient errors in scientific applications.
Contribution
It presents a novel multi-level approach integrating detection and recovery techniques tailored for HPC fault tolerance, with a comprehensive model and overhead analysis.
Findings
SEDAR effectively detects and recovers from transient faults.
Different levels of checkpointing offer trade-offs between overhead and fault coverage.
The methodology is adaptable to various HPC system needs.
Abstract
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
