Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU   Reconfiguration

Tianyu Wang; Sheng Li; Bingyao Li; Yue Dai; Ao Li; Geng Yuan; Yufei; Ding; Youtao Zhang; Xulong Tang

arXiv:2407.13126·cs.DC·July 19, 2024

Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration

Tianyu Wang, Sheng Li, Bingyao Li, Yue Dai, Ao Li, Geng Yuan, Yufei, Ding, Youtao Zhang, Xulong Tang

PDF

Open Access

TL;DR

This paper introduces MIGRator, a dynamic GPU reconfiguration system leveraging NVIDIA's multi-instance GPU technology, to optimize multi-tenancy in continuous learning workloads by balancing inference SLOs and model accuracy.

Contribution

MIGRator is a novel runtime that formulates GPU reconfiguration as an ILP problem, effectively managing resource contention and workload dynamics for continuous learning.

Findings

01

MIGRator outperforms state-of-the-art GPU sharing techniques by 17-21%.

02

It effectively balances inference SLOs and model accuracy.

03

Dynamic reconfiguration improves GPU utilization and workload performance.

Abstract

Continuous learning (CL) has emerged as one of the most popular deep learning paradigms deployed in modern cloud GPUs. Specifically, CL has the capability to continuously update the model parameters (through model retraining) and use the updated model (if available) to serve overtime arriving inference requests. It is generally beneficial to co-locate the retraining and inference together to enable timely model updates and avoid model transfer overheads. This brings the need for GPU sharing among retraining and inferences. Meanwhile, multiple CL workloads can share the modern GPUs in the cloud, leading to multi-tenancy execution. In this paper, we observe that prior GPU-sharing techniques are not optimized for multi-tenancy CL workloads. Specifically, they do not coherently consider the accuracy of the retraining model and the inference service level objective (SLO) attainment.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques