DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

Hanzhi Zhang; Heng Fan; Kewei Sha; Yan Huang; Yunhe Feng

arXiv:2506.11104·cs.CL·June 16, 2025

DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

Hanzhi Zhang, Heng Fan, Kewei Sha, Yan Huang, Yunhe Feng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a dynamic attention mask mechanism for long-context LLMs that adaptively assigns attention patterns, reducing computational costs while maintaining high performance without the need for fine-tuning.

Contribution

It proposes a novel dynamic sparse attention method that learns context-aware masks, outperforming static methods and enabling efficient long-sequence processing without fine-tuning.

Findings

01

Achieves high alignment with full-attention models.

02

Reduces memory and compute overhead significantly.

03

Maintains retrieval accuracy in long-sequence tasks.

Abstract

Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanzhizhang-ulrica/dam
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis