EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge
Jiahe Cao, Xiaomeng Li, Qiang Liu, Tao Han, Ning Zhang, Weisong Shi

TL;DR
EdgeServing is a deadline-aware system for multi-DNN serving at the edge that optimizes GPU sharing and scheduling to improve latency predictability and reduce SLO violations.
Contribution
It introduces a novel deadline-aware scheduling approach with early-exit inference and a stability score for better multi-DNN GPU sharing at the edge.
Findings
Outperforms baselines in SLO violation ratio and P95 latency.
Uses early-exit inference to expand scheduling options under latency constraints.
Achieves consistent improvements across multiple hardware platforms.
Abstract
As edge computing expands, serving multiple deep neural network (DNN) models on a single shared GPU has become a common yet challenging scenario, where each scheduling decision affects the tail latency of all concurrent queues. Existing schedulers rely on local heuristics and fail to capture this global impact, while GPU spatial-sharing approaches sacrifice latency predictability. In this paper, we propose EdgeServing, a deadline-aware multi-DNN serving system for edge devices. EdgeServing adopts time-division GPU sharing with early-exit inference for high inference predictability, and introduces a stability score to quantify how each candidate scheduling decision impacts the future queue status. At runtime, it cohesively selects the model, exit point, and batch size to minimize predicted system-wide SLO impact. Experimental results on multiple hardware platforms show that EdgeServing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
