Log-Linear Attention
Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim

TL;DR
This paper introduces log-linear attention, a new mechanism that combines the efficiency of linear attention with the expressiveness of softmax attention by using a logarithmically growing set of hidden states, enabling scalable and effective sequence modeling.
Contribution
It proposes log-linear attention, a novel framework that enhances linear attention with a logarithmically increasing hidden state set, improving expressiveness while maintaining efficiency.
Findings
Log-linear attention achieves log-linear complexity in sequence length.
Applied to Mamba-2 and Gated DeltaNet, it performs well compared to linear-time variants.
The framework is compatible with existing linear attention models.
Abstract
The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly…
Peer Reviews
Decision·ICLR 2026 Poster
* Potent sub-quadratic runtime alternatives to Transformers is an important open area of research, and this work provides a promising way to improve modeling quality of such architectures, hopefully bringing us closer to an algorithm capable of fully replacing Transformer in autoregressive language modeling. * The proposed method is coherent and intuitive: if we want to increase long-range performance in comparison with pure linear-time algorithms such as Mamba and DeltaNet, it is plausible tha
My judgement is that the paper doesn't have major problems. There are some minor issues mostly related to exposition/ formatting which I listed below. 1. Did you perform any measurements of the memory footprint of the algorithm during inference (prefill, decode) and training workloads? A comparison for different sequence lengths with vanilla Mamba-2 and Gated DeltaNet, as well as with FlashAttention would be helpful. I understand that it’s likely to be $O(\log(T))$ times greater than aforement
- Clear motivation - Well written - The overview of related variants and the view of efficient attention mechanisms as different parametrizations of structured masking matrices is great. It shows how this naturally results in the idea & implementation for log-linear attention. - To the best of my knowledge log-linear attention is a novel method for expanding the state size. - The paper provides simple pure PyTorch implementations and shows experiments with optimized kernels (even though code for
- Only small performance improvements over linear counter parts / base methods across several tasks (admitted by authors) - The authors place log-linear attention as middle ground between standard attention and linear attention in terms of memory state size: Hence I would expect an exemplary calculation of the memory consumption of log-linear attention, standard linear attention and KV-cache for various sequence lengths and reasonable model sizes - No code provided for Mamba2 and Gated Delta Net
- fast parallel and hardware-aware implementation - captures typical inductive bias on shorter time-scales (recency bias)
- theoretically unclear why the extended memory can be effectively used, except for not "bloating" the long-term memory at the highest level with short time-scale information that can be store in lower levels - mild improvements on benchmarks - unclear scaling behavior
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Parallel Computing and Optimization Techniques · Machine Learning in Healthcare
MethodsAttention Is All You Need · Softmax · Sparse Evolutionary Training
