SAM Decoding: Speculative Decoding via Suffix Automaton
Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong, Chen, Jing Zhang

TL;DR
This paper introduces SAM Decoding, a novel suffix automaton-based speculative decoding method that significantly accelerates large language model inference by efficiently finding exact suffix matches, outperforming existing retrieval-based SD techniques.
Contribution
The paper proposes SAM Decoding, a new suffix automaton approach for speculative decoding that improves speed and accuracy, and can be integrated with existing methods for broader domain applicability.
Findings
SAM Decoding is 18% faster than other retrieval-based SD methods.
Combining SAM Decoding with EAGLE-2 yields an additional 3.28%-11.13% speedup.
The method achieves efficient exact suffix matching with O(1) average time complexity.
Abstract
Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration. Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on incomplete retrieval resources, inefficient retrieval methods, and are constrained to certain domains. This paper presents a novel retrieval-based speculative decoding method that adapts suffix automaton (SAM) for efficient and accurate draft generation by utilizing common text corpus and dynamic text sequence. Unlike existing -gram matching methods, SAM-Decoding finds the exact longest suffix match, achieving an average time complexity of O(1) per generation step of SAM update and suffix retrieval. It can also integrate with existing methods, adaptively selecting a draft generation strategy based on match length to generalize to broader domains. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAlgorithms and Data Compression · Coding theory and cryptography · semigroups and automata theory
MethodsSegment Anything Model · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
