Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration
Tianyu Wang, Sheng Li, Bingyao Li, Yue Dai, Ao Li, Geng Yuan, Yufei, Ding, Youtao Zhang, Xulong Tang

TL;DR
This paper introduces MIGRator, a dynamic GPU reconfiguration system leveraging NVIDIA's multi-instance GPU technology, to optimize multi-tenancy in continuous learning workloads by balancing inference SLOs and model accuracy.
Contribution
MIGRator is a novel runtime that formulates GPU reconfiguration as an ILP problem, effectively managing resource contention and workload dynamics for continuous learning.
Findings
MIGRator outperforms state-of-the-art GPU sharing techniques by 17-21%.
It effectively balances inference SLOs and model accuracy.
Dynamic reconfiguration improves GPU utilization and workload performance.
Abstract
Continuous learning (CL) has emerged as one of the most popular deep learning paradigms deployed in modern cloud GPUs. Specifically, CL has the capability to continuously update the model parameters (through model retraining) and use the updated model (if available) to serve overtime arriving inference requests. It is generally beneficial to co-locate the retraining and inference together to enable timely model updates and avoid model transfer overheads. This brings the need for GPU sharing among retraining and inferences. Meanwhile, multiple CL workloads can share the modern GPUs in the cloud, leading to multi-tenancy execution. In this paper, we observe that prior GPU-sharing techniques are not optimized for multi-tenancy CL workloads. Specifically, they do not coherently consider the accuracy of the retraining model and the inference service level objective (SLO) attainment.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques
