On the Partitioning of GPU Power among Multi-Instances
Tirth Vamja, Kaustabha Ray, Felix George, UmaMaheswari C Devi

TL;DR
This paper develops software-based methods, including machine learning models, to accurately estimate power consumption of individual GPU partitions in multi-instance GPUs, addressing a key challenge for energy-efficient cloud computing.
Contribution
It introduces online, workload-specific power models that improve accuracy of power attribution among GPU partitions, overcoming limitations of generic models.
Findings
ML-based models outperform generic models in accuracy
Online models adapt to diverse workloads and concurrent usage
Demonstrated on NVIDIA A100 GPUs with ML and LLM workloads
Abstract
Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power consumption. NVIDIA's Multi-Instance GPU (MIG) technology improves GPU utilization by enabling isolated partitions with per-partition resource tracking, facilitating GPU sharing by multiple tenants. However, accurately apportioning GPU power consumption among MIG instances remains challenging due to a lack of hardware support. This paper addresses this challenge by developing software methods to estimate power usage per MIG partition. We analyze NVIDIA GPU utilization metrics and find that light-weight methods with good accuracy can be difficult to construct. We hence explore the use of ML-based power models to enable accurate, partition-level power…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Interconnection Networks and Systems
