Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

Hoang Anh Duy Le (1); Sahil Joshi (1); Zeyu Yang (1); Zhaozhuo Xu (2); Anshumali Shrivastava (1) ((1) Rice University; (2) Stevens Institute of Technology)

arXiv:2602.07397·cs.LG·February 10, 2026

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

Hoang Anh Duy Le (1), Sahil Joshi (1), Zeyu Yang (1), Zhaozhuo Xu (2), Anshumali Shrivastava (1) ((1) Rice University, (2) Stevens Institute of Technology)

PDF

Open Access

TL;DR

Sketch&Walk Attention is a training-free sparse attention method that uses lightweight sketches and deterministic walks to efficiently approximate attention, reducing computation and memory costs in long-context LLM inference while maintaining high accuracy.

Contribution

We propose Sketch&Walk Attention, a novel training-free sparse attention technique that dynamically selects attention blocks using sketching and walk mechanisms, applicable to both prefill and decode phases.

Findings

01

Maintains near-lossless accuracy at 20% attention density

02

Achieves up to 6x inference speedup

03

Outperforms dense attention in some settings

Abstract

Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling