A Machine Learning Approach to Online Fault Classification in HPC Systems
Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea, Bartolini, Andrea Borghesi

TL;DR
This paper presents a machine learning-based fault classification method for HPC systems that operates in real-time, using fault injection and a new tool to improve system resiliency with high accuracy and low overhead.
Contribution
It introduces an online fault classification approach for HPC systems, supported by a new fault injection tool and a publicly available fault dataset.
Findings
Achieves near-perfect fault classification accuracy
Operates with low computational overhead
Supports real-time fault detection and response
Abstract
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
