Intelligent colocation of HPC workloads
Felippe V. Zacarias (1, 2, 3), Vinicius Petrucci (1, 5), Rajiv, Nishtala (4), Paul Carpenter (3), Daniel Moss\'e (5) ((1) Universidade, Federal da Bahia, (2) Universitat Polit\`ecnica de Catalunya, (3) Barcelona, Supercomputing Center, (4) Coop

TL;DR
This paper introduces a machine learning-based resource management approach for HPC systems that predicts performance degradation due to application colocation, enabling optimized scheduling and improving overall efficiency.
Contribution
It presents a novel machine learning model for predicting performance impacts of colocation and an intelligent scheduling scheme integrated into existing resource managers.
Findings
Achieves 7% average performance improvement over standard policies.
Max performance improvement reaches 12%.
Effective in reducing performance degradation due to resource contention.
Abstract
Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure that all critical resources are fully exploited by a single application, so an attractive technique for increasing HPC system utilization is to colocate multiple applications on the same server. When applications share critical resources, however, contention on shared resources may lead to reduced application performance. In this paper, we show that server efficiency can be improved by first modeling the expected performance degradation of colocated applications based on measured hardware performance counters, and then exploiting the model to determine an optimized mix of colocated applications. This paper presents a new intelligent resource manager…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
