ML Inference Scheduling with Predictable Latency
Haidong Zhao, Nikolaos Georgantas

TL;DR
This paper investigates the challenges of scheduling ML inference requests on GPUs, focusing on interference effects that impact latency and proposing insights into the limitations of existing interference prediction methods.
Contribution
It evaluates the limitations of current interference prediction approaches, highlighting issues with coarse granularity and static models under dynamic workloads.
Findings
Coarse-grained prediction methods show significant accuracy deviations.
Static models perform poorly under workload changes.
Interference effects critically impact latency and scheduling effectiveness.
Abstract
Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. In this paper, we evaluate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy
