Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Zhuoming Chen; Avner May; Ruslan Svirschevski; Yuhsun Huang; Max Ryabinin; Zhihao Jia; Beidi Chen

arXiv:2402.12374·cs.CL·July 8, 2025·2 cites

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

PDF

Open Access 1 Repo 1 Models

TL;DR

Sequoia is a novel speculative decoding algorithm that enhances large language model inference speed by employing scalable, robust, and hardware-aware techniques, including dynamic programming and adaptive token tree optimization.

Contribution

Sequoia introduces a dynamic programming approach for optimal token tree structure, a robust sampling and verification method, and a hardware-aware tree optimizer for scalable speculative decoding.

Findings

01

Up to 4.04x faster decoding on Llama2-7B and Llama2-13B models.

02

Achieves 0.56 s/token inference latency on Llama2-70B with offloading.

03

Outperforms prior methods like DeepSpeed-Zero-Inference and Huggingface Accelerate.

Abstract

As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

infini-ai-lab/sequoia
pytorchOfficial

Models

🤗
InfiniAILab/CodeDrafter-500M
model· 45 dl· ♡ 1
45 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings