Algorithmic Based Fault Tolerance Applied to High Performance Computing
George Bosilca, Remi Delmas, Jack Dongarra, and Julien Langou

TL;DR
This paper introduces a scalable, algorithmic-based fault tolerance method for high-performance computing that detects and corrects errors during computation, demonstrated through a fault-tolerant matrix multiplication achieving high efficiency.
Contribution
It adapts Algorithmic Based Fault Tolerance to parallel systems, providing a scalable fault tolerance mechanism with error correction capabilities for HPC.
Findings
Achieved 1.4 TFLOPS on 484 processors with fault tolerance
Fault tolerance overhead is less than 12%
Overhead decreases as processor count increases
Abstract
We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
