PROSERVE: Unified Multi-Priority Request Scheduling for LLM Serving
Weizhe Huang, Tao Peng, Tongxuan Liu, Donghe Jin, Xianzhe Dong, Ke Zhang

TL;DR
PROSERVE introduces a unified scheduling framework for LLM serving that optimizes request prioritization and SLO adherence, significantly enhancing system gain and performance in diverse, real-world scenarios.
Contribution
The paper formalizes multi-priority request scheduling as a service gain maximization problem and proposes PROSERVE, a two-tier framework with dynamic batching and gain-aware dispatching.
Findings
Outperforms baselines with up to 35% higher system gain.
Increases SLO attainment by up to 52%.
Effective in diverse datasets and real-world industrial trace.
Abstract
The widespread deployment of large language models (LLMs) for interactive applications necessitates serving systems that can handle thousands of concurrent requests with diverse Service Level Objective (SLO) requirements. A critical yet often overlooked dimension in this context is the inherent priority difference among clients; for instance, business-critical functions demand higher performance guarantees, as fulfilling such requests yields significantly greater business value. However, existing LLM serving schedulers fail to jointly optimize for both SLO attainment and client-level priorities. To bridge this gap, we first \textit{formalize multi-priority request scheduling as a service gain maximization problem}, where satisfying latency requirements for requests of different priorities contributes varying levels of gain. We then propose PROSERVE, a unified two-tier scheduling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
