A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems
Beste Oztop, Dhruva Kulkarni, Zhengji Zhao, Ayse Kivilcim Coskun, Kadidia Konate

TL;DR
This paper presents a two-stage framework for predicting GPU resource utilization and power consumption in HPC systems, leveraging workload logs and GPU profiling data for improved scheduling and power management.
Contribution
It introduces a novel two-stage prediction framework that combines workload logs and GPU metrics to accurately forecast GPU utilization and power in heterogeneous HPC environments.
Findings
GPU utilization prediction accuracy up to 97% using Slurm logs
Power prediction accuracy up to 92% with GPU metrics
GPU profiling metrics effectively capture application characteristics
Abstract
Efficient utilization of GPU resources and power has become critical with the growing demand for GPUs in high-performance computing (HPC). In this paper, we analyze GPU utilization and GPU memory utilization, as well as the power consumption of the Vienna ab initio Simulation Package (VASP), using the Slurm workload manager historical logs and GPU performance metrics collected by NVIDIA's Data Center GPU Manager (DCGM). VASP is a widely used materials science application on Perlmutter at NERSC, an HPE Cray EX system based on NVIDIA A100 GPUs. Using our insights from the resource utilization analysis of VASP applications, we propose a resource prediction framework to predict the average GPU power, maximum GPU utilization, and maximum GPU memory utilization values of heterogeneous HPC system applications to enable more efficient scheduling decisions and power-aware system operation. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
