Transformer Based Linear Attention with Optimized GPU Kernel Implementation
Armin Gerami, Ramani Duraiswami

TL;DR
This paper introduces a highly optimized GPU implementation of linear attention mechanisms in Transformers, significantly improving speed and memory efficiency while maintaining comparable accuracy to traditional softmax attention.
Contribution
The paper presents a novel CUDA-based implementation of linear attention's forward and backward passes, outperforming existing methods in speed and memory usage.
Findings
3.3x faster than state-of-the-art implementations
Memory consumption reduced by 3.6x
Maintains accuracy comparable to regular attention in large language models
Abstract
The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between tokens, each embedded in a -dimensional head, with a time complexity of . Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA's forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3 times in speed and reduces memory consumption by 3.6 times. We validate these improvements in both single-layer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
