Log-Linear Attention

Han Guo; Songlin Yang; Tarushii Goel; Eric P. Xing; Tri Dao; Yoon Kim

arXiv:2506.04761·cs.LG·March 3, 2026

Log-Linear Attention

Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim

PDF

Open Access 3 Reviews

TL;DR

This paper introduces log-linear attention, a new mechanism that combines the efficiency of linear attention with the expressiveness of softmax attention by using a logarithmically growing set of hidden states, enabling scalable and effective sequence modeling.

Contribution

It proposes log-linear attention, a novel framework that enhances linear attention with a logarithmically increasing hidden state set, improving expressiveness while maintaining efficiency.

Findings

01

Log-linear attention achieves log-linear complexity in sequence length.

02

Applied to Mamba-2 and Gated DeltaNet, it performs well compared to linear-time variants.

03

The framework is compatible with existing linear attention models.

Abstract

The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

* Potent sub-quadratic runtime alternatives to Transformers is an important open area of research, and this work provides a promising way to improve modeling quality of such architectures, hopefully bringing us closer to an algorithm capable of fully replacing Transformer in autoregressive language modeling. * The proposed method is coherent and intuitive: if we want to increase long-range performance in comparison with pure linear-time algorithms such as Mamba and DeltaNet, it is plausible tha

Weaknesses

My judgement is that the paper doesn't have major problems. There are some minor issues mostly related to exposition/ formatting which I listed below. 1. Did you perform any measurements of the memory footprint of the algorithm during inference (prefill, decode) and training workloads? A comparison for different sequence lengths with vanilla Mamba-2 and Gated DeltaNet, as well as with FlashAttention would be helpful. I understand that it’s likely to be $O(\log(T))$ times greater than aforement

Reviewer 02Rating 8Confidence 4

Strengths

- Clear motivation - Well written - The overview of related variants and the view of efficient attention mechanisms as different parametrizations of structured masking matrices is great. It shows how this naturally results in the idea & implementation for log-linear attention. - To the best of my knowledge log-linear attention is a novel method for expanding the state size. - The paper provides simple pure PyTorch implementations and shows experiments with optimized kernels (even though code for

Weaknesses

- Only small performance improvements over linear counter parts / base methods across several tasks (admitted by authors) - The authors place log-linear attention as middle ground between standard attention and linear attention in terms of memory state size: Hence I would expect an exemplary calculation of the memory consumption of log-linear attention, standard linear attention and KV-cache for various sequence lengths and reasonable model sizes - No code provided for Mamba2 and Gated Delta Net

Reviewer 03Rating 4Confidence 4

Strengths

- fast parallel and hardware-aware implementation - captures typical inductive bias on shorter time-scales (recency bias)

Weaknesses

- theoretically unclear why the extended memory can be effectively used, except for not "bloating" the long-term memory at the highest level with short time-scale information that can be store in lower levels - mild improvements on benchmarks - unclear scaling behavior

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Parallel Computing and Optimization Techniques · Machine Learning in Healthcare

MethodsAttention Is All You Need · Softmax · Sparse Evolutionary Training