PARIS: Predicting Application Resilience Using Machine Learning
Luanzheng Guo, Dong Li, Ignacio Laguna

TL;DR
PARIS is a machine learning-based resilience prediction method that accurately predicts multiple fault manifestations across unseen applications, significantly faster than traditional fault injection methods.
Contribution
It introduces a novel ML approach that predicts three classes of fault outcomes and generalizes to new applications, improving over existing single-class models.
Findings
Achieves 82% accuracy in success prediction
Provides 77% accuracy in interruption prediction
Outperforms state-of-the-art in SDC prediction with 38% accuracy
Abstract
Extreme-scale scientific applications can be more vulnerable to soft errors (transient faults) as high-performance computing systems increase in scale. The common practice to evaluate the resilience to faults of an application is random fault injection, a method that can be highly time consuming. While resilience prediction modeling has been recently proposed to predict application resilience in a faster way than fault injection, it can only predict a single class of fault manifestation (SDC) and there is no evidence demonstrating that it can work on previously unseen programs, which greatly limits its re-usability. We present PARIS, a resilience prediction method that addresses the problems of existing prediction methods using machine learning. Using carefully-selected features and a machine learning model, our method is able to make resilience predictions of three classes of fault…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Distributed systems and fault tolerance · Software System Performance and Reliability
