RedThreads: An Interface for Application-level Fault Detection/Correction through Adaptive Redundant Multithreading
Saurabh Hukerikar, Keita Teranishi, Pedro C. Diniz, Robert F. Lucas

TL;DR
RedThreads introduces an adaptive interface for application-level fault detection and correction using redundant multithreading, allowing dynamic control over redundancy to balance resilience and performance in high-performance computing applications.
Contribution
It presents a novel interface and runtime system for adaptive, application-level redundant multithreading to improve fault tolerance with minimal performance impact.
Findings
Adaptive RMT effectively balances resilience and overhead.
Runtime system dynamically enables/disables redundancy.
Experimental results show improved fault coverage with controlled overhead.
Abstract
In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance. In this paper we present RedThreads, an interface that provides application-level fault detection and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Distributed systems and fault tolerance · Parallel Computing and Optimization Techniques
