Stochastic Sparse Attention for Memory-Bound Inference

Kyle Lee; Corentin Delacour; Kevin Callahan-Coray; Kyle Jiang; Can Yaras; Samet Oymak; Tathagata Srimani; and Kerem Y. Camsari

arXiv:2605.01910·cs.LG·May 5, 2026

Stochastic Sparse Attention for Memory-Bound Inference

Kyle Lee, Corentin Delacour, Kevin Callahan-Coray, Kyle Jiang, Can Yaras, Samet Oymak, Tathagata Srimani, and Kerem Y. Camsari

PDF

1 Repo

TL;DR

The paper introduces SANTA, a stochastic sparse attention method that reduces memory and computation during long-context autoregressive decoding, achieving significant speedups while maintaining accuracy.

Contribution

It proposes a novel stochastic sparse attention technique with variance reduction and GPU optimization, enabling faster, energy-efficient inference for long-context models.

Findings

01

Achieves 1.5x speedup in attention kernel over existing methods.

02

Maintains baseline accuracy at 32k-token contexts.

03

Reduces key-feature access via Bernoulli sampling.

Abstract

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_{k}$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S ≪ n_{k}$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5 \times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $q K^{T}$ sampling as a complementary technique to sparsify the score stage, reducing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OPUSLab/SANTA.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.