LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

Haocheng Xi; Harman Singh; Yuezhou Hu; Coleman Hooper; Rishabh Tiwari; Aditya Tomar; Minjae Lee; Wonjun Kang; Michael Mahoney; Chenfeng Xu; Kurt Keutzer; Amir Gholami

arXiv:2604.12056·cs.CL·April 15, 2026

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, Kurt Keutzer, Amir Gholami

PDF

TL;DR

LOSA introduces a locality-aware sparse attention mechanism that reuses cached attention results for stable tokens, significantly improving efficiency and accuracy in block-wise diffusion language models.

Contribution

It proposes a novel sparse attention method that addresses KV Inflation by caching stable token representations, enhancing speed and accuracy in long-context DLMs.

Findings

01

Achieves up to +9 points in accuracy at high sparsity levels

02

Provides up to 4.14x attention speedup on GPUs

03

Maintains near-dense accuracy with reduced attention density

Abstract

Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.