RecurFormer: Not All Transformer Heads Need Self-Attention
Ruiqing Yan, Linghan Zheng, Xingbo Du, Han Zou, Yufeng Guo, Jianfei, Yang

TL;DR
RecurFormer introduces a novel architecture that replaces certain attention heads in Transformer models with recurrent neural networks, significantly reducing inference costs while maintaining performance on long-input tasks.
Contribution
It proposes a method to replace recency-aware attention heads with RNNs, enabling more efficient inference without sacrificing long-range dependency modeling.
Findings
RecurFormer matches original model performance.
Significant reduction in inference computational costs.
Maintains long-range dependency modeling.
Abstract
Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention mechanism's memory overhead. We observe that certain attention heads exhibit a distribution where the attention weights concentrate on tokens near the query token, termed as recency aware, which focuses on local and short-range dependencies. Leveraging this insight, we propose RecurFormer, a novel architecture that replaces these attention heads with linear recurrent neural networks (RNNs), specifically the Mamba architecture. This replacement reduces the cache size without evicting tokens, thus maintaining generation quality. RecurFormer retains the ability to model long-range dependencies through the remaining attention heads and allows for reusing pre-trained Transformer-based LLMs…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
This paper is relatively easy to understand, presentation of the method is clear. And the idea of replacing parts of the attention heads with Mamba instead of all is worth exploring.
However, the experimental results are pretty weak. - The only evaluation metric used is HashHop, which gives very limited idea of the finetuned model's performance. The author should consider evaluate on more general and standard langauge tasks, such as the ones in lm-evaluation-harness. - To evaluate performance on the long context reasoning ability, the author should consider standard eval on passkey retrival dataset. - The proposed approach can be considered as within-layer hybrid mamba. It
The paper introduces RecurFormer, a practical solution that integrates linear RNNs into Transformer models, specifically replacing certain attention heads with the Mamba architecture to enhance inference efficiency. This approach effectively reduces inference cache size without evicting tokens, thereby maintaining generation quality while addressing the computational challenges associated with Transformer-based LLMs. The methodology is well-documented, provide a detailed overview of the selectio
1. In Table 3, the loss comparison indeed shows a noticeable PPL gap between models of different scales compared to the original structure, particularly for the 0.5B parameter model where β was set to 0.5, yet the PPL still increased by more than 2. 2. The paper lacks a performance comparison with MQA, as well as comparisons of hybrid approaches between different heads and layers. This omission makes it difficult to assess RecurFormer's relative advantages in these areas. 3. The benchm
1. The idea presented in this paper of converting short-sighted attention heads into linear attention has innovation. 2. The presentation is clear.
1. The main issue with this paper is that the experiments are not comprehensive enough, as they are only conducted on two synthetic datasets. Additional experiments on a wider range of synthetic datasets and real-world tasks, such as InfiniteBench and LongBench, are needed. 2. The baselines used for comparison are insufficient. Some straightforward solutions, such as converting short-sighted attention heads into sliding window attention like Razor Attention [1], or using linear attention to focu
The motivation behind the paper is solid – cache reduction through token eviction is problematic, and the “linearization” of recency-aware heads is an intriguing and novel approach for non-eviction based cache reduction.
The major issue is the lack of experimental validation, which is quite limited. Standard language benchmarks (Hellaswag, MMLU, etc.) that the base models were evaluated on is missing, so it is not possible to gauge the extent to which the proposed approaches degrades (or doesn’t degrade) NLU performance. Though the MQAR ablations are suggestive of possible strengths relative to pure linear models for longer-context tasks, prior work has shown that linearized models [3] struggle at long context
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
