TL;DR
ELSA introduces an exact, hardware-agnostic attention algorithm for vision transformers that significantly improves speed and memory efficiency without sacrificing precision, suitable for diverse hardware platforms.
Contribution
It reformulates softmax attention as an associative prefix scan, enabling exact, parallel, and hardware-independent implementation with provable accuracy bounds.
Findings
ELSA achieves 1.3-3.5x speedup on A100 benchmarks.
ELSA outperforms existing memory-efficient methods on BERT and LLaMA-13B.
ELSA operates efficiently on resource-constrained devices like Jetson TX2.
Abstract
Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{ELSA}, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emph{provable} FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid , yielding extra memory and parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emph{drop-in replacement} requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
