Online Job Failure Prediction in an HPC System
Francesco Antici, Andrea Borghesi, and Zeynep Kiziltan

TL;DR
This paper presents an online machine learning approach that combines NLP techniques to predict job failures at submit-time in HPC systems, aiming to optimize performance and energy efficiency.
Contribution
It introduces a novel combination of classical ML algorithms with NLP tools for real-time job failure prediction in HPC environments.
Findings
Approach shows promising prediction accuracy.
Works effectively in an online, real-system setting.
Utilizes data from a production HPC system at CINECA.
Abstract
Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue given the ongoing environmental and energetic crisis. Therefore, developing strategies to optimize HPC system management has paramount importance, both to guarantee top-tier performance and to improve energy efficiency. One strategy is to act at the workload level and highlight the jobs that are most likely to fail, prior to their execution on the system. Jobs failing during their execution unnecessarily occupy resources which could delay other jobs, adversely affecting the system performance and energy consumption. In this paper, we study job failure prediction at submit-time using classical machine learning algorithms. Our novelty lies in (i) the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Software System Performance and Reliability
