MOARD: Modeling Application Resilience to Transient Faults on Data Objects
Luanzheng Guo, Dong Li

TL;DR
This paper introduces MOARD, a novel method and tool for modeling and quantifying application resilience to transient hardware faults on data objects, addressing limitations of traditional fault injection methods.
Contribution
The paper presents MOARD, a systematic approach to measure application resilience considering data semantics, enabling targeted fault tolerance strategies.
Findings
MOARD effectively quantifies error masking events.
Using MOARD guides efficient fault tolerance mechanisms.
Application semantics significantly influence error tolerance.
Abstract
Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we lack a method and a tool to quantify application resilience to transient faults on data objects. The traditional method, random fault injection, cannot help, because of losing data semantics and insufficient information on how and where errors are tolerated. In this paper, we introduce a method and a tool (called MOARD) to model and quantify application resilience to transient faults on data objects. Our method is based on systematically quantifying error masking events caused by application-inherent semantics and program constructs. We use MOARD to study how and why errors in data objects can be tolerated by the application. We demonstrate tangible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Distributed systems and fault tolerance · Security and Verification in Computing
