TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

TL;DR
TurboSpec is a dynamic control system that optimizes intra-request parallelism in LLM serving to maximize goodput, adapting to environment conditions and workload variations for consistent performance gains.
Contribution
It introduces TurboSpec, an automatic, feedback-based system that tunes intra-request parallelism in LLM serving, reducing the need for expert manual tuning and improving robustness.
Findings
TurboSpec improves goodput across diverse workloads.
It adapts to different hardware configurations effectively.
Consistent performance gains are demonstrated in real-world tests.
Abstract
Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving. However, in real-world deployments, such inter-request parallelism from batching is often limited by external factors such as low request rates or memory constraints. Recent works focus on intra-request parallelism from speculative decoding as a solution to this problem. Unfortunately, benefits from intra-request parallelism are often fragile, as speculative decoding causes overhead, and speculated tokens may miss. We observe that speculative decoding may degrade LLM serving performance if added naively without tuning to the incoming requests and the speculation method. To alleviate the need for expert tuning and make speculative decoding more robust, we present TurboSpec, a speculation control system that automatically profiles the execution environment and utilizes a feedback-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
