LUNA: Linear Universal Neural Attention with Generalization Guarantees
Ashkan Shahbazi, Ping He, Ali Abbasi, Yikun Bai, Xinran Liu, Elaheh Akbari, Darian Salehi, Navid NaderiAlizadeh, Soheil Kolouri

TL;DR
LUNA introduces a learnable kernelized linear attention mechanism that maintains linear computational complexity while matching or surpassing the accuracy of traditional quadratic attention methods, enabling efficient long-sequence processing.
Contribution
LUNA's key innovation is learning the kernel feature map, allowing data-specific adaptation and overcoming the limitations of fixed feature maps in linear attention models.
Findings
Achieves state-of-the-art accuracy on Long Range Arena benchmark.
Outperforms fixed linearization methods in post-hoc conversion experiments.
Maintains linear time and memory scaling with sequence length.
Abstract
Scaling attention faces a critical bottleneck: the quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to , they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce \textsc{LUNA}, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. \textsc{LUNA} is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, \textsc{LUNA} learns a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
