Erasure coding for fault oblivious linear system solvers
David F. Gleich, Ananth Grama, Yao Zhu

TL;DR
This paper introduces an input augmentation method inspired by erasure coding to enable fault-oblivious linear system solving, significantly reducing overhead and improving fault tolerance in large-scale parallel systems.
Contribution
It proposes a novel fault-tolerance approach using input augmentation and output recovery for linear solvers, reducing resource overhead compared to traditional methods.
Findings
Fault correction with less than 10% overhead for single faults
Effective handling of up to 20% fault rates with reasonable overhead
Significant improvement over existing fault-tolerance techniques
Abstract
Dealing with hardware and software faults is an important problem as parallel and distributed systems scale to millions of processing cores and wide area networks. Traditional methods for dealing with faults include checkpoint-restart, active replicas, and deterministic replay. Each of these techniques has associated resource overheads and constraints. In this paper, we propose an alternate approach to dealing with faults, based on input augmentation. This approach, which is an algorithmic analog of erasure coded storage, applies a minimally modified algorithm on the augmented input to produce an augmented output. The execution of such an algorithm proceeds completely oblivious to faults in the system. In the event of one or more faults, the real solution is recovered using a rapid reconstruction method from the augmented output. We demonstrate this approach on the problem of solving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterconnection Networks and Systems · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems
