Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo
Valerio Formicola, Saurabh Jha, Daniel Chen, Fei Deng and, Amanda Bonnie, Mike Mason, Jim Brandt, Ann Gentile, Larry Kaplan, and Jason Repik, Jeremy Enos, Mike Showerman, Annette Greiner and, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Bill Krammer

TL;DR
This paper reports fault injection experiments on the Cielo supercomputer to understand failure causes, propagation, and impacts, aiming to improve system resilience and create targeted failure scenarios.
Contribution
It introduces a comprehensive fault injection methodology for Cray supercomputers, linking failure data analysis with experimental fault scenarios for system and application testing.
Findings
Characterized fault-error-failure sequences in Cray systems
Identified impact of failures on applications at different scales
Developed fault injection techniques for unrecoverable failure scenarios
Abstract
We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA's Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, 2) to understand the impact of failures on the system and the user applications at different scale, and 3) to identify and recreate fault scenarios that induce unrecoverable failures, in order to create new tests for system and application design. The faults were injected through special input commands to bring down network links, directional connections, nodes, and blades. We present extensions that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Radiation Effects in Electronics · Software System Performance and Reliability
