Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement   Learning Approach

Urvij Saroliya; Eishi Arima; Dai Liu; Martin Schulz

arXiv:2405.08754·cs.DC·May 15, 2024

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz

PDF

TL;DR

This paper presents a reinforcement learning-based method for hierarchical resource partitioning on modern GPUs, significantly improving throughput by efficiently co-scheduling multiple jobs using features like MPS and MIG.

Contribution

It introduces a novel reinforcement learning approach to optimize hierarchical GPU resource partitioning and job co-scheduling, enhancing utilization and performance.

Findings

01

Maximum throughput improved by 1.87x over time-sharing.

02

Effective joint optimization of partitioning and scheduling.

03

Demonstrated success on NVIDIA GPU features MPS and MIG.

Abstract

GPU-based heterogeneous architectures are now commonly used in HPC clusters. Due to their architectural simplicity specialized for data-level parallelism, GPUs can offer much higher computational throughput and memory bandwidth than CPUs in the same generation do. However, as the available resources in GPUs have increased exponentially over the past decades, it has become increasingly difficult for a single program to fully utilize them. As a consequence, the industry has started supporting several resource partitioning features in order to improve the resource utilization by co-scheduling multiple programs on the same GPU die at the same time. Driven by the technological trend, this paper focuses on hierarchical resource partitioning on modern GPUs, and as an example, we utilize a combination of two different features available on recent NVIDIA GPUs in a hierarchical manner: MPS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training