PARIS: Predicting Application Resilience Using Machine Learning

Luanzheng Guo; Dong Li; Ignacio Laguna

arXiv:1812.02944·cs.DC·December 10, 2018·1 cites

PARIS: Predicting Application Resilience Using Machine Learning

Luanzheng Guo, Dong Li, Ignacio Laguna

PDF

Open Access

TL;DR

PARIS is a machine learning-based resilience prediction method that accurately predicts multiple fault manifestations across unseen applications, significantly faster than traditional fault injection methods.

Contribution

It introduces a novel ML approach that predicts three classes of fault outcomes and generalizes to new applications, improving over existing single-class models.

Findings

01

Achieves 82% accuracy in success prediction

02

Provides 77% accuracy in interruption prediction

03

Outperforms state-of-the-art in SDC prediction with 38% accuracy

Abstract

Extreme-scale scientific applications can be more vulnerable to soft errors (transient faults) as high-performance computing systems increase in scale. The common practice to evaluate the resilience to faults of an application is random fault injection, a method that can be highly time consuming. While resilience prediction modeling has been recently proposed to predict application resilience in a faster way than fault injection, it can only predict a single class of fault manifestation (SDC) and there is no evidence demonstrating that it can work on previously unseen programs, which greatly limits its re-usability. We present PARIS, a resilience prediction method that addresses the problems of existing prediction methods using machine learning. Using carefully-selected features and a machine learning model, our method is able to make resilience predictions of three classes of fault…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiation Effects in Electronics · Distributed systems and fault tolerance · Software System Performance and Reliability