Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models
Difan Deng, Andreas Bentzen Winje, Lukas Fehring, Marius Lindauer

TL;DR
This paper introduces NAtS-L, a hybrid attention framework that adaptively combines linear and softmax attention at the token level, improving efficiency and expressivity for long-context models.
Contribution
The paper proposes a novel token-level hybrid attention model that dynamically selects between linear and softmax attention, enhancing efficiency and expressivity in long-context transformers.
Findings
NAtS-L achieves better efficiency compared to pure softmax models.
The hybrid approach maintains high expressivity for long-term dependencies.
Experimental results demonstrate improved performance on long-context tasks.
Abstract
The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Advanced Graph Neural Networks
