AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding

Zikun Li; Zhuofu Chen; Remi Delacourt; Gabriele Oliaro; Zeyu Wang; Qinghan Chen; Shuhuai Lin; April Yang; Zhihao Zhang; Zhuoming Chen; Sean Lai; Xinhao Cheng; Xupeng Miao; Zhihao Jia

arXiv:2501.12162·cs.CL·May 20, 2025

AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding

Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, Sean Lai, Xinhao Cheng, Xupeng Miao, Zhihao Jia

PDF

Open Access

TL;DR

AdaServe is a novel LLM serving system that uses SLO-customized speculative decoding to efficiently meet diverse latency requirements, significantly reducing violations and increasing throughput.

Contribution

It introduces a hardware-aware, SLO-specific speculative decoding framework that formulates multi-SLO serving as a constrained optimization problem.

Findings

01

Reduces SLO violations by up to 4.3×

02

Improves goodput by up to 1.9×

03

Adapts dynamically to workload variations

Abstract

Modern large language model (LLM) applications exhibit diverse service-level objectives (SLOs), from low-latency requirements in interactive coding assistants to more relaxed constraints in data wrangling tasks. Existing LLM serving systems, which rely on uniform batching and scheduling strategies, often fail to meet these heterogeneous SLOs concurrently. We present AdaServe, the first LLM serving system designed to support efficient multi-SLO serving through SLO-customized speculative decoding. AdaServe formulates multi-SLO serving as a constrained optimization problem and introduces a hardware-aware algorithm that constructs a speculation tree tailored to each request's latency target. It features a speculate-select-verify pipeline that enables fine-grained control over decoding speed while maximizing system throughput. AdaServe further adapts to workload variation by dynamically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDNA and Biological Computing · Network Packet Processing and Optimization · Algorithms and Data Compression

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings