Managing Multi Instance GPUs for High Throughput and Energy Savings

Abhijeet Saraha; Yuanbo Li; Chris Porter; Santosh Pande

arXiv:2508.18556·cs.DC·August 27, 2025

Managing Multi Instance GPUs for High Throughput and Energy Savings

Abhijeet Saraha, Yuanbo Li, Chris Porter, Santosh Pande

PDF

TL;DR

This paper presents novel partitioning and scheduling schemes for multi-instance GPUs that significantly improve throughput and energy efficiency across various workloads, including scientific and machine learning tasks.

Contribution

It introduces dynamic memory estimation, partition fusion, and fission techniques, along with process restart strategies, to optimize GPU resource utilization and performance.

Findings

01

Up to 6.20x throughput improvement for general workloads

02

Up to 5.93x energy savings for general workloads

03

Significant gains in ML and LLM workloads, including 1.43x throughput and 1.11x energy savings

Abstract

Modern GPUs such as the Ampere series (A30, A100) as well as the Hopper series (H100, H200) offer performance as well as security isolation features. They also support a good amount of concurrency, but taking advantage of it can be quite challenging due to the complex constraints on partitioning the chip. In this work, we develop partitioning and scheduling schemes for a variety of workloads, ranging from scientific to modern ML workloads, including LLMs. We develop several schemes involving dynamic memory estimation, partition fusion and partition fission. We also support process restart to recover from out-of-memory errors for workloads and early restart as an optimization. This approach yields up to 6.20x throughput and 5.93x energy improvements for general workloads; and we see 1.59x and 1.12x improvement to throughput and energy, respectively, for ML workloads on an A100 GPU. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.