TL;DR
DashAttention introduces a fully differentiable, adaptive sparse hierarchical attention mechanism that improves long-context modeling in large language models while maintaining efficiency and accuracy.
Contribution
It proposes a novel differentiable sparse attention method using $oldsymbol{ extalpha}$-entmax, enabling adaptive block selection and better long-context modeling.
Findings
Achieves comparable accuracy to full attention with 75% sparsity.
Outperforms NSA and InfLLMv2 in Pareto efficiency, especially at high sparsity.
Provides an efficient GPU implementation with significant speedup.
Abstract
Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse -entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
