Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Bangsheng Tang; Carl Chengyan Fu; Fei Kou; Grigory Sizov; Haoci Zhang; Jason Park; Jiawen Liu; Jie You; Qirui Yang; Sachin Mehta; Shengyong Cai; Xiaodong Wang; Xingyu Liu; Yunlu Li; Yanjun Zhou; Wei Wei; Zhiwei Zhao; Zixi Qi; Adolfo Victoria; Aya Ibrahim; Bram Wasti; Changkyu Kim; Daniel Haziza; Fei Sun; Giancarlo Delfin; Emily Guo; Jialin Ouyang; Jaewon Lee; Jianyu Huang; Jeremy Reizenstein; Lu Fang; Quinn Zhu; Ria Verma; Vlad Mihailescu; Xingwen Guo; Yan Cui; Ye Hu; Yejin Lee

arXiv:2508.08192·cs.CL·August 12, 2025

Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Bangsheng Tang, Carl Chengyan Fu, Fei Kou, Grigory Sizov, Haoci Zhang, Jason Park, Jiawen Liu, Jie You, Qirui Yang, Sachin Mehta, Shengyong Cai, Xiaodong Wang, Xingyu Liu, Yunlu Li, Yanjun Zhou, Wei Wei, Zhiwei Zhao, Zixi Qi, Adolfo Victoria, Aya Ibrahim, Bram Wasti

PDF

Open Access

TL;DR

This paper presents optimized speculative decoding techniques for Llama models, achieving state-of-the-art inference latency and significant speed-ups on GPU hardware, addressing engineering challenges for production deployment.

Contribution

The paper introduces training and inference optimizations enabling EAGLE-based speculative decoding at scale for Llama models, improving speed and efficiency.

Findings

01

Llama4 Maverick decodes at about 4 ms per token on 8 GPUs

02

Achieved 10% faster inference than previous methods

03

Enabled 1.4x to 2.0x speed-up for large batch sizes

Abstract

Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis