
TL;DR
Exact Linear Attention (ELA) introduces a kernel-based linear attention mechanism that reduces computational complexity and memory usage, enabling faster decoding and efficient visual modeling while maintaining high performance.
Contribution
The paper proposes a novel exact linear attention mechanism with kernel constraints, along with engineering innovations like Hyper-Link, Memory Lobe, and routing bias, advancing efficient Transformer modeling.
Findings
ELA achieves up to 6x faster decoding speed.
75% reduction in KV cache memory usage.
Extends to vision models with YOLO-LAT, 4.3x GPU speedup.
Abstract
This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by exploiting the exact decomposition property of kernel functions, thereby eliminating approximation error. We identify and address two key limitations of prior linear attention -- gradient explosion and token attention dilution -- by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel, each tailored for specific attention behaviors. Beyond the core attention formulation, the paper presents three engineering innovations: (1) a Hyper-Link structure that replaces traditional residual connections to mitigate gradient degradation; (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
