AdaSplash: Adaptive Sparse Flash Attention

Nuno Gon\c{c}alves; Marcos Treviso; Andr\'e F. T. Martins

arXiv:2502.12082·cs.CL·June 10, 2025

AdaSplash: Adaptive Sparse Flash Attention

Nuno Gon\c{c}alves, Marcos Treviso, Andr\'e F. T. Martins

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

AdaSplash introduces an efficient GPU-based method combining adaptive sparsity and $oldsymbol{ ext{α}}$-entmax to significantly improve runtime and memory efficiency of attention mechanisms in transformers for long-context tasks.

Contribution

It presents a hybrid Halley-bisection algorithm and custom Triton kernels that enhance the efficiency of $oldsymbol{ ext{α}}$-entmax, enabling practical long-context transformer training.

Findings

01

Achieves 7-fold reduction in $oldsymbol{ ext{α}}$-entmax computation iterations.

02

Substantially improves runtime and memory efficiency over existing $oldsymbol{ ext{α}}$-entmax methods.

03

Approaches or surpasses the efficiency of optimized softmax implementations like FlashAttention-2.

Abstract

The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $α$ -entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $α$ -entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $α$ -entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deep-spin/adasplash
pytorchOfficial

Models

Videos

AdaSplash: Adaptive Sparse Flash Attention· slideslive

Taxonomy

TopicsImage and Video Quality Assessment

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Byte Pair Encoding · WordPiece · Layer Normalization · Residual Connection · Linear Layer · Linear Warmup With Linear Decay · Dense Connections