ML Inference Scheduling with Predictable Latency

Haidong Zhao; Nikolaos Georgantas

arXiv:2512.18725·cs.LG·December 25, 2025

ML Inference Scheduling with Predictable Latency

Haidong Zhao, Nikolaos Georgantas

PDF

Open Access

TL;DR

This paper investigates the challenges of scheduling ML inference requests on GPUs, focusing on interference effects that impact latency and proposing insights into the limitations of existing interference prediction methods.

Contribution

It evaluates the limitations of current interference prediction approaches, highlighting issues with coarse granularity and static models under dynamic workloads.

Findings

01

Coarse-grained prediction methods show significant accuracy deviations.

02

Static models perform poorly under workload changes.

03

Interference effects critically impact latency and scheduling effectiveness.

Abstract

Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. In this paper, we evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy