Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Long Cheng; Ritchie Zhao; Timmy Liu; Mindy Li; Xianjie Qiao; Kefeng Duan; Yu-Jung Chen; Xiaoming Chen; Bita Darvish Rouhani; June Yang

arXiv:2604.22312·cs.DC·April 27, 2026

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang

PDF

TL;DR

The paper introduces GVR, a data-aware Top-K algorithm that leverages temporal correlation to significantly speed up sparse-attention decoding in large language models on Blackwell hardware.

Contribution

GVR exploits temporal correlation across decode steps to achieve faster, bit-exact Top-K selection in sparse-attention decoders, improving latency and throughput.

Findings

01

GVR achieves 1.88x speedup over radix-select kernel.

02

GVR improves end-to-end latency by up to 7.52% at 100K context.

03

GVR maintains bit-exact Top-K outputs while accelerating decoding.

Abstract

Sparse-attention decoders rely on exact Top-K selection to choose the most important key-value entries for each query token. In long-context LLM serving, this Top-K stage runs once per decode query and becomes a meaningful latency bottleneck even when the indexer and attention kernels are already highly optimized. We present \textbf{Guess-Verify-Refine (GVR)}, a data-aware exact Top-K algorithm for sparse-attention decoding on NVIDIA Blackwell. GVR exploits temporal correlation across consecutive decode steps: it uses the previous step's Top-K as a prediction signal, computes pre-indexed statistics, narrows to a valid threshold by secant-style counting in 1-2 global passes, verifies candidates with a ballot-free collector, and finishes exact selection in shared memory. We connect this behavior to the Toeplitz / RoPE structure of DeepSeek Sparse Attention (DSA) indexer scores and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.