DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Yuxiang Huang; Nuno M. T. Gon\c{c}alves; Federico Alvetreti; Lei Li; Xu Han; Edoardo M. Ponti; Andr\'e F. T. Martins; and Marcos V. Treviso

arXiv:2605.18753·cs.CL·May 19, 2026

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Yuxiang Huang, Nuno M. T. Gon\c{c}alves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti, Andr\'e F. T. Martins, and Marcos V. Treviso

PDF

4 Models

TL;DR

DashAttention introduces a fully differentiable, adaptive sparse hierarchical attention mechanism that improves long-context modeling in large language models while maintaining efficiency and accuracy.

Contribution

It proposes a novel differentiable sparse attention method using $oldsymbol{ extalpha}$-entmax, enabling adaptive block selection and better long-context modeling.

Findings

01

Achieves comparable accuracy to full attention with 75% sparsity.

02

Outperforms NSA and InfLLMv2 in Pareto efficiency, especially at high sparsity.

03

Provides an efficient GPU implementation with significant speedup.

Abstract

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse $α$ -entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.