A Survey of Multi-Tenant Deep Learning Inference on GPU
Fuxun Yu, Di Wang, Longfei Shangguan, Minjia Zhang, Chenchen Liu,, Xiang Chen

TL;DR
This survey reviews the challenges and recent advances in multi-tenant deep learning inference on GPUs, highlighting optimization strategies to improve resource utilization and system performance.
Contribution
It categorizes emerging challenges and summarizes recent technological innovations in multi-tenant DL inference on GPU systems.
Findings
Identifies key challenges in multi-tenant DL inference.
Summarizes recent optimization techniques and innovations.
Provides a comprehensive overview of the entire optimization stack.
Abstract
Deep Learning (DL) models have achieved superior performance. Meanwhile, computing hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x throughput and memory bandwidth for each generation. With such strong computing scaling of GPUs, multi-tenant deep learning inference by co-locating multiple DL models onto the same GPU becomes widely deployed to improve resource utilization, enhance serving throughput, reduce energy cost, etc. However, achieving efficient multi-tenant DL inference is challenging which requires thorough full-stack system optimization. This survey aims to summarize and categorize the emerging challenges and optimization opportunities for multi-tenant DL inference on GPU. By overviewing the entire optimization stack, summarizing the multi-tenant computing innovations, and elaborating the recent technological advances, we hope that this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
