What Artificial Intelligence can do for High-Performance Computing systems?
Pierrick Pochelu, Hyacinthe Cartiaux, Julien Schleich

TL;DR
This review explores how artificial intelligence techniques, including machine learning and optimization, can enhance the efficiency and automation of high-performance computing systems, addressing power consumption and operational costs.
Contribution
It provides a comprehensive survey of recent AI applications in HPC, categorizing research areas and identifying integration opportunities and challenges for future development.
Findings
Scheduling is the most active research area.
Supervised performance estimation is fundamental for optimization.
Graph neural networks improve anomaly detection.
Abstract
High-performance computing (HPC) centers consume substantial power, incurring environmental and operational costs. This review assesses how artificial intelligence (AI), including machine learning (ML) and optimization, improves the efficiency of operational HPC systems. Approximately 1,800 publications from 2019 to 2025 were manually screened using predefined inclusion/exclusion criteria; 74 "AI for HPC" papers were retained and grouped into six application areas: performance estimation, performance optimization, scheduling, surrogate modeling, fault detection, and language-model-based automation. Scheduling is the most active area, spanning research-oriented reinforcement-learning schedulers to production-friendly hybrids that combine ML with heuristics. Supervised performance estimation is foundational for both scheduling and optimization. Graph neural networks and time-series…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
