SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications
Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao

TL;DR
SuffixDecoding introduces an adaptive speculative decoding method that leverages suffix trees to efficiently handle predictable, repetitive sequences in emerging AI workloads, significantly accelerating large language model inference.
Contribution
It presents a novel suffix tree-based speculative decoding approach tailored for repetitive, predictable workloads in emerging AI applications, outperforming existing methods.
Findings
Achieves up to 5.3× speedup on agentic benchmarks
Outperforms state-of-the-art speculative decoding methods
Effectively exploits workload predictability for efficiency
Abstract
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
