ReAttention: Training-Free Infinite Context with Finite Attention Scope
Xiaoran Liu, Ruixiao Li, Qipeng Guo, Zhigeng Liu, Yuerong Song, Kai, Lv, Hang Yan, Linlin Li, Qun Liu, Xipeng Qiu

TL;DR
ReAttention is a training-free method that extends the context length of large language models to support infinite or multi-million token contexts using a finite attention scope, without additional training.
Contribution
It introduces a novel position-agnostic top-k attention mechanism that enables LLMs to handle infinite contexts efficiently without retraining.
Findings
Supports context lengths of at least 1 million tokens.
Enables existing LLMs to expand context length by over 100 times.
Maintains performance comparable to traditional methods.
Abstract
The long-context capability of the Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length in length extrapolation remains a critical bottleneck limiting their practical applications. The constraint of context length in LLMs arises from the self-attention mechanism, which cannot effectively and efficiently capture the semantic relationships within infinitely long contexts via the limited pre-trained positional information and attention scope. In this work, we propose ReAttention, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. ReAttention performs the position-agnostic top- attention before the ordinary position-aware self-attention, freeing LLMs from the length extrapolation issue. We validate the performance of…
Peer Reviews
Decision·ICLR 2025 Poster
1. The method is intuitive. ReAttention maintains KV caches of tokens that are more frequently attended by the latest tokens, which reduces the memory overheads and remove redundant contextual segments from being attended. 2. The evaluation is comprehensive. Authors provide evaluation for many models and benchmarks. 3. The performance improvement is notable, compared to InfLLM and full attention.
1. The two-round attention operation introduces non-trivial overheads, which may result in latency similar to full attention. 2. The efficiency in terms of latency and throughput can be the major drawbacks of this method. 3. The method requires sufficient memory resources where full attention is also feasible. The advantage over full attention lies in evicting irrelevant tokens, similar to token-dropping methods, which improves performance. However, efficiency in long-sequence tasks remains a
- The problem is highly relevant and timely, as context length remains a critical bottleneck for LLMs. The paper clearly articulates three key conditions for infinite context extension (lines 040-046). - The proposed method is simple and training-free. The two-stage attention approach (position-agnostic selection followed by regular attention) is intuitive and well-motivated. - The empirical results are impressive. They demonstrate: (1) Comparable or better performance vs full attention across
- The paper has several writing issues and typos that should be addressed: e.g., Llama vs LLaMA and the wrong quotation in line 067. - The related work section omits several relevant recent papers on efficient attention and KV-cache optimization, such as: SnapKV [1] for efficient cache management, PyramidKV [2,3] for hierarchical cache structures, GemFilter [4] for attention-filtering, and many so on. These works tackle similar challenges and a comparison would strengthen the paper. It would be
*The method is motivated by an interesting observation.* “*We also find that the third condition can be satisfied through the attention score without positional embedding.” This insight that attention scores without positional embeddings can effectively identify salient tokens is quite interesting and, to my knowledge, novel. *Method works on pre-trained models.* Unlike approaches like Mamba which require training models from scratch with new architectures, ReAttention can be applied directly
*Baselines in the long context evals are missing (Claim 2). W*hy is StreamingLLM included in the LongBench results but not the InfiniteBench results? And why is InfLLM included in the InffiniteBench results but not the LongBench results? *Baselines in million-scale context seem to be missing. (Claim 3)* For the needle in a haystack experiments supporting Claim 3, the paper should compare against other methods capable of handling very long contexts like InfLLM and other recent approaches. Withou
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsSoftmax · Attention Is All You Need
