Serving Compound Inference Systems on Datacenter GPUs
Sriram Devata, Rahul Singh, Sarita Adve

TL;DR
JigsawServe is a novel GPU serving framework that optimizes latency, accuracy, and cost for compound inference systems by adaptively allocating resources and model variants, significantly improving throughput and efficiency.
Contribution
It introduces the first framework to jointly optimize latency, accuracy, and GPU resource allocation for compound inference systems in datacenters.
Findings
Increases maximum service demand by 11.3x over prior work.
Uses only 43.3% of GPU resources while maintaining SLOs.
Achieves less than 0.6% latency violations across scenarios.
Abstract
Applications in emerging domains such as XR are being built as compound inference systems, where multiple ML models are composed in the form of a task graph to service each request. Serving these compound systems efficiently raises two questions: how to apportion end-to-end latency and accuracy budgets between different tasks in a compound inference system, and how to allocate resources effectively for different models with varying resource requirements. We present JigsawServe, the first serving framework that jointly optimizes for latency, accuracy, and cost in terms of GPU resources by adaptively choosing model variants and performing fine-grained resource allocation by spatially partitioning the GPUs for each task of a compound inference system. Analytical evaluation of a system with a large number of GPUs shows that JigsawServe can increase the maximum serviceable demand (in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Neural Network Applications · Big Data and Digital Economy
