Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
Rui Li, Zhaoning Zhang, Libo Zhang, Huaimin Wang, Xiang Fu, Zhiquan Lai

TL;DR
Nightjar is a resource-aware adaptive speculative decoding framework that dynamically optimizes LLM inference throughput and latency by adjusting speculation strategies based on workload conditions.
Contribution
It introduces a novel adaptive speculative decoding method that dynamically adjusts speculation length and disables speculation when not beneficial, improving throughput and latency.
Findings
Achieves 27.29% higher throughput on average.
Reduces latency by up to 20.18%.
Effectively adapts to dynamic request loads.
Abstract
Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Existing speculative decoding methods use fixed lengths and cannot adapt to workload changes or decide when to stop speculation. The cost of restarting speculative inference also remains unquantified. Under high load, the benefit of speculation diminishes, while retaining the draft model reduces KV-cache capacity, limiting batch size and degrading throughput. To overcome this, we propose Nightjar, a resource-aware adaptive speculative framework. It first adjusts to the request load by dynamically selecting the optimal speculative length for different batch sizes. Crucially, Nightjar proactively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
