Characterization and Comparison of Application Resilience for Serial and Parallel Executions
Kai Wu, Qiang Guan, Nathan DeBardeleben, Dong Li

TL;DR
This paper investigates the fault patterns in serial and parallel HPC applications, revealing shared and unique fault sources to better predict error rates and improve resilience strategies.
Contribution
It characterizes fault patterns in serial and parallel executions, identifying common and unique fault sources to enhance understanding of application resilience.
Findings
Serial and parallel executions share some fault sources.
Parallel execution has unique fault sources.
Understanding fault patterns aids in predicting error rates.
Abstract
Soft error of exascale application is a challenge problem in modern HPC. In order to quantify an application's resilience and vulnerability, the application-level fault injection method is widely adopted by HPC users. However, it is not easy since users need to inject a large number of faults to ensure statistical significance, especially for parallel version program. Normally, parallel execution is more complex and requires more hardware resources than its serial execution. Therefore, it is essential that we can predict error rate of parallel application based on its corresponding serial version. In this poster, we characterize fault pattern in serial and parallel executions. We find first there are same fault sources in serial and parallel execution. Second, parallel execution also has some unique fault sources compared with serial executions. Those unique fault sources are important…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Radiation Effects in Electronics
