ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

Chih-Chung Hsu; Xin-Di Ma; Wo-Ting Liao; and Chia-Ming Lee

arXiv:2604.23798·cs.LG·April 28, 2026

ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, and Chia-Ming Lee

PDF

1 Repo

TL;DR

ELSA introduces an exact, hardware-agnostic attention algorithm for vision transformers that significantly improves speed and memory efficiency without sacrificing precision, suitable for diverse hardware platforms.

Contribution

It reformulates softmax attention as an associative prefix scan, enabling exact, parallel, and hardware-independent implementation with provable accuracy bounds.

Findings

01

ELSA achieves 1.3-3.5x speedup on A100 benchmarks.

02

ELSA outperforms existing memory-efficient methods on BERT and LLaMA-13B.

03

ELSA operates efficiently on resource-constrained devices like Jetson TX2.

Abstract

Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{ELSA}, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emph{provable} $O (u lo g n)$ FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid $(m, S, W)$ , yielding $O (n)$ extra memory and $O (lo g n)$ parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emph{drop-in replacement} requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ming053l/ELSA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.