AdaSplash: Adaptive Sparse Flash Attention
Nuno Gon\c{c}alves, Marcos Treviso, Andr\'e F. T. Martins

TL;DR
AdaSplash introduces an efficient GPU-based method combining adaptive sparsity and $oldsymbol{ ext{α}}$-entmax to significantly improve runtime and memory efficiency of attention mechanisms in transformers for long-context tasks.
Contribution
It presents a hybrid Halley-bisection algorithm and custom Triton kernels that enhance the efficiency of $oldsymbol{ ext{α}}$-entmax, enabling practical long-context transformer training.
Findings
Achieves 7-fold reduction in $oldsymbol{ ext{α}}$-entmax computation iterations.
Substantially improves runtime and memory efficiency over existing $oldsymbol{ ext{α}}$-entmax methods.
Approaches or surpasses the efficiency of optimized softmax implementations like FlashAttention-2.
Abstract
The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which -entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of -entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the -entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage and Video Quality Assessment
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Byte Pair Encoding · WordPiece · Layer Normalization · Residual Connection · Linear Layer · Linear Warmup With Linear Decay · Dense Connections
