Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

TL;DR
Sequoia is a novel speculative decoding algorithm that enhances large language model inference speed by employing scalable, robust, and hardware-aware techniques, including dynamic programming and adaptive token tree optimization.
Contribution
Sequoia introduces a dynamic programming approach for optimal token tree structure, a robust sampling and verification method, and a hardware-aware tree optimizer for scalable speculative decoding.
Findings
Up to 4.04x faster decoding on Llama2-7B and Llama2-13B models.
Achieves 0.56 s/token inference latency on Llama2-70B with offloading.
Outperforms prior methods like DeepSpeed-Zero-Inference and Huggingface Accelerate.
Abstract
As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
