Implementing Software Resiliency in HPX for Extreme Scale Computing
Nikunj Gupta, Jackson R. Mayo, Adrian S. Lemoine, Hartmut Kaiser

TL;DR
This paper presents the implementation of software resiliency features in the HPX runtime system, enabling applications to handle hardware failures more reliably with minimal overhead, especially for larger tasks.
Contribution
The paper introduces two new resiliency APIs—task replication and task replay—in HPX, along with a verification API, enhancing fault tolerance for extreme scale computing.
Findings
Resiliency APIs incur minor overheads for tasks >200 μs.
Task replay and replication dominate execution time when used.
APIs effectively improve fault tolerance with minimal performance impact.
Abstract
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will invariably increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we implement software resilience in HPX, an Asynchronous Many-Task Runtime system. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay will reschedule a task up to n-times until a valid output is returned. Furthermore, we introduce an API that allows the application to verify the returned result with a user provided predicate. We test the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Radiation Effects in Electronics
