A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel   Systems

Michael Treaster

arXiv:cs/0501002·cs.DC·May 23, 2007·81 cites

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Michael Treaster

PDF

Open Access

TL;DR

This survey reviews various fault-tolerance and fault-recovery techniques used in supercomputing systems to enhance their robustness against hardware failures.

Contribution

It provides a comprehensive overview of existing fault-tolerance methods in parallel systems, highlighting their approaches and effectiveness.

Findings

01

Various fault-tolerance techniques are categorized and analyzed.

02

The survey identifies strengths and limitations of different methods.

03

It offers insights into future directions for resilient supercomputing.

Abstract

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Software System Performance and Reliability · Interconnection Networks and Systems