ZOFI: Zero-Overhead Fault Injection Tool for Fast Transient Fault Coverage Analysis
Vasileios Porpodas

TL;DR
ZOFI is a novel fault-injection tool that enables fast, zero-overhead transient fault coverage analysis by running workloads at native speed, significantly outperforming traditional simulation-based methods.
Contribution
The paper introduces ZOFI, a zero-overhead, timing-based fault-injection tool that allows for rapid transient fault coverage analysis without slowing down workload execution.
Findings
ZOFI achieves native-speed fault injection, reducing analysis time.
It is easy to use and freely available as open-source software.
ZOFI significantly outperforms existing simulation-based fault-injection tools.
Abstract
The experimental evaluation of fault-tolerance studies relies on tools that inject errors while programs are running, and then monitor the execution and the output for faulty execution. In particular, the established methodology in software-based transient-fault reliability studies, involves running each workload hundreds or thousands of times, injecting a random bit-flip in the process. The majority of such studies rely on custom-built fault-injection tools that are based on either a modified processor simulator, or a code instrumentation framework. Such tools are non-trivial to develop, and are usually orders of magnitude slower than native execution. In this paper we present ZOFI, a novel timing-based fault-injection tool that is aimed at being used in fault-coverage studies for transient faults. ZOFI is a zero-overhead tool, meaning that the analyzed workload runs at native speed.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance
