Algorithm-Based Fault Tolerance for Parallel Stencil Computations
Aur\'elien Cavelan, Florina M. Ciorba

TL;DR
This paper introduces a novel algorithm-based fault tolerance method for parallel stencil computations in HPC systems, effectively detecting and correcting silent data corruptions with minimal performance overhead.
Contribution
It presents a new ABFT approach tailored for 2D and 3D stencil computations, including formal proofs and experimental validation on real applications.
Findings
Achieves less than 8% overhead in protected applications
Accurately detects and corrects silent data corruptions
Offline ABFT offers higher correction accuracy with slight overhead
Abstract
The increase in HPC systems size and complexity, together with increasing on-chip transistor density, power limitations, and number of components, render modern HPC systems subject to soft errors. Silent data corruptions (SDCs) are typically caused by such soft errors in the form of bit-flips in the memory subsystem and hinder the correctness of scientific applications. This work addresses the problem of protecting a class of iterative computational kernels, called stencils, against SDCs when executing on parallel HPC systems. Existing SDC detection and correction methods are in general either inaccurate, inefficient, or targeting specific application classes that do not include stencils. This work proposes a novel algorithm-based fault tolerance (ABFT) method to protect scientific applications that contain arbitrary stencil computations against SDCs. The ABFT method can be applied both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
