LUNA: Linear Universal Neural Attention with Generalization Guarantees

Ashkan Shahbazi; Ping He; Ali Abbasi; Yikun Bai; Xinran Liu; Elaheh Akbari; Darian Salehi; Navid NaderiAlizadeh; Soheil Kolouri

arXiv:2512.08061·cs.LG·December 10, 2025

LUNA: Linear Universal Neural Attention with Generalization Guarantees

Ashkan Shahbazi, Ping He, Ali Abbasi, Yikun Bai, Xinran Liu, Elaheh Akbari, Darian Salehi, Navid NaderiAlizadeh, Soheil Kolouri

PDF

Open Access

TL;DR

LUNA introduces a learnable kernelized linear attention mechanism that maintains linear computational complexity while matching or surpassing the accuracy of traditional quadratic attention methods, enabling efficient long-sequence processing.

Contribution

LUNA's key innovation is learning the kernel feature map, allowing data-specific adaptation and overcoming the limitations of fixed feature maps in linear attention models.

Findings

01

Achieves state-of-the-art accuracy on Long Range Arena benchmark.

02

Outperforms fixed linearization methods in post-hoc conversion experiments.

03

Maintains linear time and memory scaling with sequence length.

Abstract

Scaling attention faces a critical bottleneck: the $O (n^{2})$ quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to $O (n)$ , they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce \textsc{LUNA}, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. \textsc{LUNA} is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, \textsc{LUNA} learns a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications