An Interpretable Latency Model for Speculative Decoding in LLM Serving
Linghao Kong, Megan Flynn, Michael Peng, Nir Shavit, Mark Kurtz, Alexandre Marques

TL;DR
This paper presents an interpretable latency model for speculative decoding in large language model serving, explaining how various factors influence latency and speedups under different load conditions.
Contribution
The authors develop a simple, validated latency model for speculative decoding that accounts for load variability and extends to mixture of experts models, aiding deployment decisions.
Findings
Model accurately predicts latency across diverse conditions.
Speedups diminish as server load increases due to load-dependent factors.
Draft length, acceptance rate, and verifier-drafter size significantly impact latency.
Abstract
Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed. In this work, we develop a simple and interpretable latency model for SD in LLM serving. We infer effective batch size from request rate using Little's Law and decompose per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. We validate our model using extensive measurements from vLLM across verifier and drafter model sizes, prefill and decode…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
