LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, Kurt Keutzer, Amir Gholami

TL;DR
LOSA introduces a locality-aware sparse attention mechanism that reuses cached attention results for stable tokens, significantly improving efficiency and accuracy in block-wise diffusion language models.
Contribution
It proposes a novel sparse attention method that addresses KV Inflation by caching stable token representations, enhancing speed and accuracy in long-context DLMs.
Findings
Achieves up to +9 points in accuracy at high sparsity levels
Provides up to 4.14x attention speedup on GPUs
Maintains near-dense accuracy with reduced attention density
Abstract
Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
