Attention Needs to Focus: A Unified Perspective on Attention Allocation
Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

TL;DR
This paper introduces Lazy Attention, a unified approach to improve attention mechanisms in Transformers by addressing overload and underload issues, leading to more focused attention and better performance.
Contribution
It proposes Lazy Attention, a novel mechanism that combines positional discrimination and Elastic-Softmax to enhance attention focus and mitigate common failures in Transformer models.
Findings
Reduces attention sink effectively.
Achieves up to 59.58% attention sparsity.
Maintains competitive performance across benchmarks.
Abstract
The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Clear conceptual framing**. The paper presents a coherent narrative linking two common attention failure modes as manifestations of improper allocation, supported by intuitive visual analyses. - **Simplicity and modularity**. Both proposed components, including score-level positional bias and post-softmax filtering, are easy to implement within standard transformer codebases. - **Comprehensive qualitative evidence**. Attention heatmaps, offset distributions, and learned bias profiles provi
- **Ambiguous core mechanism.** The paper defines Elastic-Softmax as applying a token-wise ReLU to `(softmax – τ_i^h)`, yet later sections describe thresholding as a fixed `τ_h / seq_len`. These two formulations are not equivalent and lead to different scaling behaviors. Moreover, in the reported ablations, the learned τ values often become negative, effectively *adding* probability mass instead of filtering low-attention entries, which is the opposite of the stated mechanism. - **Unnormalized
1. The paper provides a unified perspective to explain two long-standing problems (representational collapse and attention sink) in Transformer architectures,. 2. Extensive experiments on diverse benchmarks and different model scales demonstrate that Lazy Attention not only mitigates attention sink but also achieves competitive performance compared to state-of-the-art baselines in some cases.
1.The paper lacks clear definitions and explanations for key elements in Figure 2. For Figure 2a, the notation "Mask@2" is introduced when inserting a fixed [Mask] token during pre-training, but it fails to specify what "2" refers to. 2. In Line 254, the notations "d" and "D" are not explicitly defined. 3. Formula 4 introduces a learnable attention bias term b^(h)_{|i-j|}, which is a head-specific bias dependent on the relative distance |i-j| between tokens. The paper does not discuss the po
++ The paper connects collapse and sink via “allocation extremes,” then uses interventions to localize and characterize sink behavior and its dependence on positional encoding. This yields two crisp takeaways about variance footprints and the role of RPEs in shaping weights rather than embeddings. ++ The learnable head‑wise distance biases plus RoPE sharpen focus; the Elastic‑Softmax filter removes low‑relevance mass and alleviates sink. The layer‑wise offset patterns and bias curves are inform
-- The paper defines 𝛼=ReLU(softmax−𝜏_𝑖) and does not re‑normalize before applying 𝑉, so the sum of weights is no longer 1, making the output scale data‑/offset‑dependent. The metric density/sink is clear for softmax (sum=1) but less interpretable when mass is removed. In addition, there are sign/initialization inconsistencies: §4.2 sets 𝜏0^(ℎ)=1 and “divided evenly across the 𝑖 attended tokens,” whereas Table 2 reports the best variant using 𝜏_ℎ/seq_len; Fig. 5 shows negative offsets in early l
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
