FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
Wenyan Chen, Chengzhi Lu, Yanying Lin, Dmitrii Ustiugov

TL;DR
FASER introduces fine-grained phase management for speculative decoding in dynamic LLM serving, improving throughput and reducing latency by adaptively managing speculative tokens and overlapping phases.
Contribution
FASER's novel system dynamically adjusts speculative lengths and overlaps verification with drafting, enhancing efficiency in volatile online inference workloads.
Findings
FASER improves throughput by up to 53%.
FASER reduces latency by up to 1.92 times.
FASER effectively manages dynamic workloads with fine-grained control.
Abstract
Speculative decoding (SD) is a widely used approach for accelerating decode-heavy LLM inference workloads. While online inference workloads are highly dynamic, existing SD systems are rigid and take a coarse-grained approach to SD management. They typically set the speculative token length for an entire batch and serialize the execution of the draft and verification phases. Consequently, these systems fall short at adapting to volatile online inference traffic. Under low load, they exhibit prolonged latency because the draft phase blocks the verification phase for the entire batch, leaving GPU computing resources underutilized. Conversely, under high load, they waste computation on rejected tokens during the verification phase, overloading GPU resources. We introduce FASER, a novel system that features fine-grained SD phase management. First, FASER minimizes computational waste by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
