Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Rui Li; Zhaoning Zhang; Libo Zhang; Huaimin Wang; Xiang Fu; Zhiquan Lai

arXiv:2512.22420·cs.DC·March 4, 2026

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Rui Li, Zhaoning Zhang, Libo Zhang, Huaimin Wang, Xiang Fu, Zhiquan Lai

PDF

Open Access

TL;DR

Nightjar is a resource-aware adaptive speculative decoding framework that dynamically optimizes LLM inference throughput and latency by adjusting speculation strategies based on workload conditions.

Contribution

It introduces a novel adaptive speculative decoding method that dynamically adjusts speculation length and disables speculation when not beneficial, improving throughput and latency.

Findings

01

Achieves 27.29% higher throughput on average.

02

Reduces latency by up to 20.18%.

03

Effectively adapts to dynamic request loads.

Abstract

Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Existing speculative decoding methods use fixed lengths and cannot adapt to workload changes or decide when to stop speculation. The cost of restarting speculative inference also remains unquantified. Under high load, the benefit of speculation diminishes, while retaining the draft model reduces KV-cache capacity, limiting batch size and degrading throughput. To overcome this, we propose Nightjar, a resource-aware adaptive speculative framework. It first adjusts to the request load by dynamically selecting the optimal speculative length for different batch sizes. Crucially, Nightjar proactively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms