FINJ: A Fault Injection Tool for HPC Systems
Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea, Bartolini, Andrea Borghesi

TL;DR
FINJ is a versatile fault injection tool designed for HPC systems that supports complex experiments, custom workloads, and integration with other tools to simulate diverse fault conditions across many nodes.
Contribution
Introduces FINJ, a high-level fault injection framework enabling complex, customizable, and scalable fault experiments in HPC environments with easy integration capabilities.
Findings
Supports complex fault scenarios in HPC systems
Allows integration with existing fault injection tools
Enables experiments on many interacting nodes
Abstract
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Radiation Effects in Electronics
