MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context   Generation with Speculative Decoding

Ranajoy Sadhukhan; Jian Chen; Zhuoming Chen; Vashisth Tiwari; Ruihang; Lai; Jinyuan Shi; Ian En-Hsu Yen; Avner May; Tianqi Chen; Beidi Chen

arXiv:2408.11049·cs.CL·April 3, 2025

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang, Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, Beidi Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

MagicDec demonstrates that speculative decoding can significantly improve latency and throughput for long-context language model inference, even at high batch sizes, by using an intelligent drafting strategy and addressing key bottlenecks.

Contribution

The paper introduces MagicDec, a novel approach that extends speculative decoding effectiveness to high throughput regimes for long sequences, with a theoretical model for optimal strategy selection.

Findings

01

Achieves up to 2.51x speedup on Llama3.1-8B models.

02

Effectively reduces latency without sacrificing accuracy.

03

Utilizes sparse KV cache to address bottlenecks.

Abstract

Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency losslessly, but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy SD more effectively for high throughput inference. We leverage draft model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

infini-ai-lab/magicdec
pytorchOfficial

Videos

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications