Transformer Based Linear Attention with Optimized GPU Kernel Implementation

Armin Gerami; Ramani Duraiswami

arXiv:2510.21956·cs.LG·October 28, 2025

Transformer Based Linear Attention with Optimized GPU Kernel Implementation

Armin Gerami, Ramani Duraiswami

PDF

TL;DR

This paper introduces a highly optimized GPU implementation of linear attention mechanisms in Transformers, significantly improving speed and memory efficiency while maintaining comparable accuracy to traditional softmax attention.

Contribution

The paper presents a novel CUDA-based implementation of linear attention's forward and backward passes, outperforming existing methods in speed and memory usage.

Findings

01

3.3x faster than state-of-the-art implementations

02

Memory consumption reduced by 3.6x

03

Maintains accuracy comparable to regular attention in large language models

Abstract

The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$ -dimensional head, with a time complexity of $O (N^{2} D)$ . Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of $O (N D^{2})$ and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA's forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3 times in speed and reduces memory consumption by 3.6 times. We validate these improvements in both single-layer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.