Managing Multi Instance GPUs for High Throughput and Energy Savings
Abhijeet Saraha, Yuanbo Li, Chris Porter, Santosh Pande

TL;DR
This paper presents novel partitioning and scheduling schemes for multi-instance GPUs that significantly improve throughput and energy efficiency across various workloads, including scientific and machine learning tasks.
Contribution
It introduces dynamic memory estimation, partition fusion, and fission techniques, along with process restart strategies, to optimize GPU resource utilization and performance.
Findings
Up to 6.20x throughput improvement for general workloads
Up to 5.93x energy savings for general workloads
Significant gains in ML and LLM workloads, including 1.43x throughput and 1.11x energy savings
Abstract
Modern GPUs such as the Ampere series (A30, A100) as well as the Hopper series (H100, H200) offer performance as well as security isolation features. They also support a good amount of concurrency, but taking advantage of it can be quite challenging due to the complex constraints on partitioning the chip. In this work, we develop partitioning and scheduling schemes for a variety of workloads, ranging from scientific to modern ML workloads, including LLMs. We develop several schemes involving dynamic memory estimation, partition fusion and partition fission. We also support process restart to recover from out-of-memory errors for workloads and early restart as an optimization. This approach yields up to 6.20x throughput and 5.93x energy improvements for general workloads; and we see 1.59x and 1.12x improvement to throughput and energy, respectively, for ML workloads on an A100 GPU. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
