Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis
Xiaoyu Chu, Daniel Hofst\"atter, Shashikant Ilager, Sacheendra, Talluri, Duncan Kampert, Damian Podareanu, Dmitry Duplyakin, Ivona Brandic,, Alexandru Iosup

TL;DR
This study analyzes long-term operational data from a national HPC datacenter to compare the impacts of ML and generic workloads on energy, failures, and resource utilization, revealing key differences and challenges.
Contribution
It provides the first comprehensive statistical comparison of ML and generic HPC workloads, including open-source data and analysis tools for further research.
Findings
ML jobs cause GPU temperature limitations due to power usage
ML jobs have higher failure rates and longer runtimes than generic jobs
Significant energy is wasted on unsuccessful job terminations
Abstract
HPC datacenters offer a backbone to the modern digital society. Increasingly, they run Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting science, business, and other decision-making processes. However, understanding how ML jobs impact the operation of HPC datacenters, relative to generic jobs, remains desirable but understudied. In this work, we leverage long-term operational data, collected from a national-scale production HPC datacenter, and statistically compare how ML and generic jobs can impact the performance, failures, resource utilization, and energy consumption of HPC datacenters. Our study provides key insights, e.g., ML-related power usage causes GPU nodes to run into temperature limitations, median/mean runtime and failure rates are higher for ML jobs than for generic jobs, both ML and generic jobs exhibit highly variable arrival processes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Advanced Data Storage Technologies
