STree: Speculative Tree Decoding for Hybrid State-Space Models

Yangchao Wu; Zongyue Qin; Alex Wong; Stefano Soatto

arXiv:2505.14969·cs.LG·October 29, 2025

STree: Speculative Tree Decoding for Hybrid State-Space Models

Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto

PDF

Open Access 1 Models

TL;DR

This paper introduces STree, a scalable tree-based speculative decoding algorithm for state-space models and hybrid architectures, significantly improving inference efficiency over existing methods.

Contribution

It presents the first efficient tree-based speculative decoding algorithm tailored for SSMs and hybrid models, leveraging matrix structure for minimal overhead.

Findings

01

Outperforms vanilla speculative decoding on three benchmarks.

02

Enables efficient tree-based decoding in hybrid SSM-Transformer models.

03

Provides a hardware-aware implementation for practical deployment.

Abstract

Speculative decoding is a technique to leverage hardware concurrency in order to enable multiple steps of token generation in a single forward pass, thus improving the efficiency of large-scale autoregressive (AR) Transformer models. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ycwu97/mamba2-distilled-small
model· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax