Strait: Perceiving Priority and Interference in ML Inference Serving
Haidong Zhao, Nikolaos Georgantas

TL;DR
Strait is a GPU inference serving system that improves deadline satisfaction for high-priority ML tasks by modeling contention and interference, enabling priority-aware scheduling under high utilization.
Contribution
It introduces a novel latency estimation and scheduling approach that accounts for data transfer contention and kernel interference in ML inference serving.
Findings
Reduces deadline violations for high-priority tasks by up to 11.18 percentage points.
Achieves more equitable performance compared to preemption-based approaches.
Effectively handles intense workloads with high GPU utilization.
Abstract
Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under concurrent execution may restrict their applicability in on-premises scenarios. We present \emph{Strait}, a serving system designed to enhance deadline satisfaction for dual-priority inference traffic under high GPU utilization. To improve latency estimation, Strait models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model. By drawing on these predictions, it performs priority-aware scheduling to deliver differentiated handling. Evaluation results under intense workloads suggest that Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
