Trainable Dynamic Mask Sparse Attention

Jingze Shi; Yifan Wu; Yiran Peng; Bingheng Wu; Liangdong Wang; Guang Liu; Yuyu Luo

arXiv:2508.02124·cs.AI·November 18, 2025

Trainable Dynamic Mask Sparse Attention

Jingze Shi, Yifan Wu, Yiran Peng, Bingheng Wu, Liangdong Wang, Guang Liu, Yuyu Luo

PDF

Open Access

TL;DR

This paper introduces a trainable dynamic mask sparse attention mechanism that adaptively combines position-aware and content-aware strategies, significantly improving efficiency and performance in long-context modeling for large language models.

Contribution

It proposes a novel dynamic mask sparse attention method that is fully differentiable, adaptable, and hardware-efficient, advancing sparse attention techniques for large language models.

Findings

01

Consistently outperforms state-of-the-art sparse attention baselines.

02

Achieves up to 10x speedup in computational efficiency.

03

Effectively balances model efficiency with long-context modeling capabilities.

Abstract

The increasing demand for long-context modeling in large language models (LLMs) is bottlenecked by the quadratic complexity of the standard self-attention mechanism. The community has proposed sparse attention to mitigate this issue. However, position-aware sparse attention methods rely on static sparse structures that lack adaptability to diverse query contexts, while content-aware sparse attention methods depend on heuristic key-value selection, hindering full differentiability. We introduce a trainable dynamic mask sparse attention mechanism, a method that merges the advantages of both position-aware and content-aware approaches. Dynamic Mask Attention (DMA) achieves this through three key innovations: First, it leverages value vector representations to generate content-aware dynamic masks, enabling the model to adaptively identify and attend to critical information. Second, it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection · Visual Attention and Saliency Detection