TL;DR
StreamIndex enables memory-efficient, scalable sparse attention for long sequences by streaming top-k selection, significantly extending GPU memory limits with minimal loss in accuracy.
Contribution
We introduce StreamIndex, a chunked streaming top-k driver that avoids full intermediate materialization, enabling sparse attention on sequences over a million tokens.
Findings
Runs at sequence length 1,048,576 with 6.21 GB peak HBM, 32x larger than previous limits.
Achieves near-perfect recall (≥0.9980) across multiple design configurations.
Performs attention computations on 262,144-length sequences in under 2 seconds with 18.56 GB peak memory.
Abstract
DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget. We present StreamIndex, a Triton implementation of the CSA pipeline whose central component is a chunked partition-merge top-k driver that never materializes the full intermediate. On synthetic-but-realistic V4-shaped inputs at the indexer-step (layer) level on a single NVIDIA H200, the materialize path runs out of memory (OOMs) at S=65,536 with V4-Flash dimensions; StreamIndex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
