Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
Wenhu Zhang, Yiming Wu, Huanyu Wang, Yaoyang Liu, Huanzhang Dou, Senqiao Yang, Sitong Wu, Hanbin Zhao, Jiaya Jia

TL;DR
This paper introduces BA-Att, a novel block sparse attention method for diffusion language models that improves efficiency and stability in long-sequence modeling by identifying informative regions without relying on fixed positional priors.
Contribution
The paper proposes BA-Att, a new block-wise sparse attention framework with theoretical analysis and practical modules that significantly accelerates attention computation while maintaining performance.
Findings
Achieves up to 6.95x faster attention computation than FlashAttention.
Maintains near full-attention performance at 50% sparsity across various models.
Demonstrates strong efficiency and generalization in language, multimodal, and video models.
Abstract
Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
