Near-zero Downtime Recovery from Transient-error-induced Crashes
Chao Chen, Greg Eisenhauer, and Santosh Pande

TL;DR
IterPro is a lightweight, compiler-assisted resilience technique that enables near-zero downtime recovery from transient errors in HPC systems by repairing process states on-the-fly with minimal overhead.
Contribution
It introduces IterPro, a novel compiler-based method with new code transformations for fast, accurate recovery from transient errors in high-performance computing.
Findings
Recovers 83.55% of crash errors within dozens of milliseconds
Incurs almost zero runtime overhead during normal execution
Uses small, fixed 27MB memory overhead
Abstract
Due to the system scaling, transient errors caused by external noises, e.g., heat fluxes and particle strikes, have become a growing concern for the current and upcoming extreme-scale high-performance-computing (HPC) systems. However, since such errors are still quite rare as compared to no-fault cases, desirable solutions call for low/no-overhead systems that do not compromise the performance under no-fault conditions and also allow very fast fault recovery to minimize downtime. In this paper, we present IterPro, a light-weight compiler-assisted resilience technique to quickly and accurately recover processes from transient-error-induced crashes. IterPro repairs the corrupted process states on-the-fly upon occurrences of errors, enabling applications to continue their executions instead of being terminated. IterPro also exploits side effects introduced by induction variable based code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Distributed systems and fault tolerance · Parallel Computing and Optimization Techniques
