Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights
Bo Fang, Daoce Wang, Sian Jin, Quincey Koziol, Zhao Zhang, Qiang Guan,, Suren Byna, Sriram Krishnamoorthy, Dingwen Tao

TL;DR
This paper introduces FFIS, a fault injection framework for studying how SSD-related storage faults affect HPC applications, revealing their error resilience and impact on data integrity.
Contribution
It presents a novel FUSE-based fault injection methodology to systematically analyze storage fault impacts on HPC applications and data formats.
Findings
Different HPC applications react variably to storage faults.
HDF5 file format shows specific resilience characteristics.
FFIS effectively models SSD-related data corruptions.
Abstract
In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the HPC applications under the SSD-related failures remains unclear, in particular for failures resulting in data corruptions. The goal of this paper is to understand the impact of SSD-related faults on the behaviors of complex HPC applications. To this end, we propose FFIS, a FUSE-based fault injection framework that systematically introduces storage faults into the application layer to model the errors originated from SSDs. FFIS is able to plant different I/O related faults into the data returned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Parallel Computing and Optimization Techniques
