Trainable Dynamic Mask Sparse Attention
Jingze Shi, Yifan Wu, Yiran Peng, Bingheng Wu, Liangdong Wang, Guang Liu, Yuyu Luo

TL;DR
This paper introduces a trainable dynamic mask sparse attention mechanism that adaptively combines position-aware and content-aware strategies, significantly improving efficiency and performance in long-context modeling for large language models.
Contribution
It proposes a novel dynamic mask sparse attention method that is fully differentiable, adaptable, and hardware-efficient, advancing sparse attention techniques for large language models.
Findings
Consistently outperforms state-of-the-art sparse attention baselines.
Achieves up to 10x speedup in computational efficiency.
Effectively balances model efficiency with long-context modeling capabilities.
Abstract
The increasing demand for long-context modeling in large language models (LLMs) is bottlenecked by the quadratic complexity of the standard self-attention mechanism. The community has proposed sparse attention to mitigate this issue. However, position-aware sparse attention methods rely on static sparse structures that lack adaptability to diverse query contexts, while content-aware sparse attention methods depend on heuristic key-value selection, hindering full differentiability. We introduce a trainable dynamic mask sparse attention mechanism, a method that merges the advantages of both position-aware and content-aware approaches. Dynamic Mask Attention (DMA) achieves this through three key innovations: First, it leverages value vector representations to generate content-aware dynamic masks, enabling the model to adaptively identify and attend to critical information. Second, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Visual Attention and Saliency Detection
