Optimizing Hardware Resource Partitioning and Job Allocations on Modern   GPUs under Power Caps

Eishi Arima; Minjoon Kang; Issa Saba; Josef Weidendorfer; Carsten; Trinitis; Martin Schulz

arXiv:2405.03838·cs.DC·May 8, 2024

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

Eishi Arima, Minjoon Kang, Issa Saba, Josef Weidendorfer, Carsten, Trinitis, Martin Schulz

PDF

TL;DR

This paper presents a systematic methodology for optimizing GPU resource partitioning, job allocation, and power capping in heterogeneous HPC systems to improve utilization and energy efficiency under power constraints.

Contribution

It introduces a novel approach combining hardware-level GPU partitioning with power capping, tailored for power-constrained HPC environments, supported by interference and scalability modeling.

Findings

01

Achieves near-optimal resource and power configurations across diverse workloads.

02

Improves GPU utilization and energy efficiency in power-capped HPC systems.

03

Demonstrates effectiveness through experimental validation.

Abstract

CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, as power consumption has become the first-class design constraint for HPC systems, such co-scheduling techniques should be well-tailored for power-constrained environments. To this end, the industry recently started supporting hardware-level resource partitioning features on modern GPUs for realizing efficient co-scheduling, which can operate with existing power capping features. For example, NVidia's MIG (Multi-Instance GPU) partitions one single GPU into multiple instances at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.