Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

Yilong Zhao; Jiaming Tang; Kan Zhu; Zihao Ye; Chi-Chih Chang; Chaofan Lin; Jongseok Park; Guangxuan Xiao; Mohamed S. Abdelfattah; Mingyu Gao; Baris Kasikci; Song Han; Ion Stoica

arXiv:2512.01278·cs.LG·December 2, 2025

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang, Chaofan Lin, Jongseok Park, Guangxuan Xiao, Mohamed S. Abdelfattah, Mingyu Gao, Baris Kasikci, Song Han, Ion Stoica

PDF

Open Access

TL;DR

SparseSpec introduces a novel sparse attention and self-speculative decoding framework that significantly accelerates large-scale reasoning model inference by reducing memory bandwidth bottlenecks, achieving up to 2.13x throughput improvements.

Contribution

The paper presents SparseSpec, a self-speculative decoding method with a new sparse attention mechanism and system co-design, enabling faster inference for reasoning models.

Findings

01

Achieves up to 2.13x throughput speedup over state-of-the-art methods.

02

Reduces memory bandwidth pressure during long chain-of-thought generations.

03

Demonstrates effectiveness across various models and datasets.

Abstract

Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SparseSpec, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SparseSpec features a novel sparse attention mechanism, PillarAttn, as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)