# Application-Level Resilience Modeling for HPC Fault Tolerance

**Authors:** Luanzheng Guo, Hanlin He, Dong Li

arXiv: 1705.00267 · 2017-05-02

## TL;DR

This paper introduces a data-driven methodology to quantify application resilience in HPC by analyzing application-level fault masking, providing insights beyond traditional random fault injection methods.

## Contribution

The paper presents a novel, deterministic approach to model application resilience based on inherent semantics and program constructs, improving fault tolerance analysis in HPC.

## Key findings

- Model effectively captures application resilience mechanisms
- Guides the design of targeted fault tolerance strategies
- Offers more deterministic insights than RFI methods

## Abstract

Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently, we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides little information on how fault tolerance happens, and RFI results are often not deterministic due to its random nature. In this paper, we introduce a new methodology to quantify the application resilience. Our methodology is based on the observation that at the application level, the application resilience to faults is due to the application-level fault masking. The application-level fault masking happens because of application-inherent semantics and program constructs. Based on this observation, we analyze application execution information and use a data-oriented approach to model the application resilience. We use our model to study how and why HPC applications can (or cannot) tolerate faults. We demonstrate tangible benefits of using the model to direct fault tolerance mechanisms.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1705.00267/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/1705.00267/full.md

## References

55 references — full list in the complete paper: https://tomesphere.com/paper/1705.00267/full.md

---
Source: https://tomesphere.com/paper/1705.00267