Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications
Li Tan, Marc Charest, Nathan DeBardeleben, Qiang Guan, Ben Bergen

TL;DR
This paper explores application-level resilience for continuum dynamics software by leveraging domain invariants, proposing a checksum-based recovery method, and demonstrating its effectiveness through fault injection experiments.
Contribution
It introduces a novel checksum-retry approach utilizing domain invariants for lightweight failure detection and recovery in continuum dynamics applications.
Findings
Effective fault detection using invariants
Lightweight, non-intrusive recovery method
Successful fault injection experiments
Abstract
The persistently growing resilience concerns of large-scale computing systems today require not only generic fault tolerance approaches, but also application-level resilience, due to demanding efficiency and various domain-specific requirements. Scientific applications within a particular domain generally comply with domain conservation laws, which can be leveraged as an error detection criterion to study the resilience of this domain of applications sharing similar program characteristics. However, it is challenging to achieve application resilience: (a) how to identify the invariants of a given domain of applications, knowing the conservation laws, and (b) how to utilize the invariants to efficiently detect and recover from failures in application runs. In this work, we target several continuum dynamics software packages, FleCSALE [1] and CODY [2] (with intrinsic invariants during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Software System Performance and Reliability · Security and Verification in Computing
