SSV: Sparse Speculative Verification for Efficient LLM Inference

Zhibin Wang; Ziyu Zhong; Nuo Shen; Yuhang Zhou; Rong Gu; Sheng Zhong

arXiv:2605.19893·cs.OS·May 21, 2026

SSV: Sparse Speculative Verification for Efficient LLM Inference

Zhibin Wang, Ziyu Zhong, Nuo Shen, Yuhang Zhou, Rong Gu, Sheng Zhong

PDF

TL;DR

SSV introduces a framework that enhances the efficiency of large language model inference by combining speculative verification with dynamic sparse attention, leading to significant throughput and speed improvements.

Contribution

The paper presents SSV, a novel framework that aligns dynamic sparse attention with speculative verification, improving cross-query reuse and reducing overheads in LLM inference.

Findings

01

Up to 3.49x end-to-end throughput improvement over NSA decoding.

02

Up to 6.86x kernel speedups for sparse speculative verification.

03

Effective prompt-adaptive strategy selection under user-specified precision classes.

Abstract

Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set. Directly combining them, however, exposes a structural mismatch: speculative verification relies on cross-query commonality, whereas dynamic sparse attention assigns query-specific sparse layouts. This mismatch limits KV-block reuse, amplifies NSA's branch-wise overheads, and makes verification strategy selection input- and regime-dependent. We present SSV, a sparse speculative-verification framework that turns dynamic sparse attention into a verification-oriented workload. SSV combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.