StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

Jaber Jaber; Osama Jaber

arXiv:2605.02568·cs.LG·May 5, 2026

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

Jaber Jaber, Osama Jaber

PDF

1 Repo

TL;DR

StreamIndex enables memory-efficient, scalable sparse attention for long sequences by streaming top-k selection, significantly extending GPU memory limits with minimal loss in accuracy.

Contribution

We introduce StreamIndex, a chunked streaming top-k driver that avoids full intermediate materialization, enabling sparse attention on sequences over a million tokens.

Findings

01

Runs at sequence length 1,048,576 with 6.21 GB peak HBM, 32x larger than previous limits.

02

Achieves near-perfect recall (≥0.9980) across multiple design configurations.

03

Performs attention computations on 262,144-length sequences in under 2 seconds with 18.56 GB peak memory.

Abstract

DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget. We present StreamIndex, a Triton implementation of the CSA pipeline whose central component is a chunked partition-merge top-k driver that never materializes the full intermediate. On synthetic-but-realistic V4-shaped inputs at the indexer-step (layer) level on a single NVIDIA H200, the materialize path runs out of memory (OOMs) at S=65,536 with V4-Flash dimensions; StreamIndex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RightNow-AI/StreamIndex
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.