MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang, Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, Beidi Chen

TL;DR
MagicDec demonstrates that speculative decoding can significantly improve latency and throughput for long-context language model inference, even at high batch sizes, by using an intelligent drafting strategy and addressing key bottlenecks.
Contribution
The paper introduces MagicDec, a novel approach that extends speculative decoding effectiveness to high throughput regimes for long sequences, with a theoretical model for optimal strategy selection.
Findings
Achieves up to 2.51x speedup on Llama3.1-8B models.
Effectively reduces latency without sacrificing accuracy.
Utilizes sparse KV cache to address bottlenecks.
Abstract
Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency losslessly, but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy SD more effectively for high throughput inference. We leverage draft model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
