SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

Gabriele Oliaro; Zhihao Jia; Daniel Campos; Aurick Qiao

arXiv:2411.04975·cs.CL·October 9, 2025

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

SuffixDecoding introduces an adaptive speculative decoding method that leverages suffix trees to efficiently handle predictable, repetitive sequences in emerging AI workloads, significantly accelerating large language model inference.

Contribution

It presents a novel suffix tree-based speculative decoding approach tailored for repetitive, predictable workloads in emerging AI applications, outperforming existing methods.

Findings

01

Achieves up to 5.3× speedup on agentic benchmarks

02

Outperforms state-of-the-art speculative decoding methods

03

Effectively exploits workload predictability for efficiency

Abstract

Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

snowflakedb/arcticinference
pytorchOfficial

Models

🤗
nielsr/test-model-v3
model

Videos

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis