SSV: Sparse Speculative Verification for Efficient LLM Inference
Zhibin Wang, Ziyu Zhong, Nuo Shen, Yuhang Zhou, Rong Gu, Sheng Zhong

TL;DR
SSV introduces a framework that enhances the efficiency of large language model inference by combining speculative verification with dynamic sparse attention, leading to significant throughput and speed improvements.
Contribution
The paper presents SSV, a novel framework that aligns dynamic sparse attention with speculative verification, improving cross-query reuse and reducing overheads in LLM inference.
Findings
Up to 3.49x end-to-end throughput improvement over NSA decoding.
Up to 6.86x kernel speedups for sparse speculative verification.
Effective prompt-adaptive strategy selection under user-specified precision classes.
Abstract
Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set. Directly combining them, however, exposes a structural mismatch: speculative verification relies on cross-query commonality, whereas dynamic sparse attention assigns query-specific sparse layouts. This mismatch limits KV-block reuse, amplifies NSA's branch-wise overheads, and makes verification strategy selection input- and regime-dependent. We present SSV, a sparse speculative-verification framework that turns dynamic sparse attention into a verification-oriented workload. SSV combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
