Selective Attention Improves Transformer

Yaniv Leviathan; Matan Kalman; Yossi Matias

arXiv:2410.02703·cs.CL·April 25, 2025·2 cites

Selective Attention Improves Transformer

Yaniv Leviathan, Matan Kalman, Yossi Matias

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Selective Attention, a simple, parameter-free modification to the standard attention mechanism that enhances language modeling performance and reduces memory and compute costs by focusing attention on relevant elements.

Contribution

The paper proposes Selective Attention, a novel, parameter-free method that improves transformer performance and efficiency by selectively reducing attention to unneeded elements.

Findings

01

Improves language modeling performance across various model sizes.

02

Reduces memory and compute requirements during inference.

03

Achieves comparable performance with fewer attention heads and smaller context sizes.

Abstract

Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention consistently improves language modeling and downstream task performance in a variety of model sizes and context lengths. For example, transformers trained with the language modeling objective on C4 with selective attention perform language modeling equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. **Efficient Memory Management**: The Selective Attention mechanism effectively prunes unneeded tokens, significantly reducing memory usage during inference without degrading model performance. This efficiency gain is particularly valuable for scaling transformers in resource-constrained environments. 2. **No Additional Parameters**: Selective Attention operates without introducing new parameters or significantly increasing computational overhead, which preserves the simplicity of the transfor

Weaknesses

1. **Limited Scope of Model Architectures**: The experiments are primarily conducted on decoder-only transformer models. Further analysis is needed to verify if Selective Attention can similarly benefit encoder models or encoder-decoder models used in other tasks, such as translation or summarization. 2. **Potential Over-Reliance on Hyperparameter Tuning**: Selective Attention’s performance may depend on optimal memory budget settings per layer, which could complicate deployment in different tas

Reviewer 02Rating 6Confidence 3

Strengths

- Given the simplicity of the proposed approach and the reported performance gains, it seems that selective attention could be a significant addition to the transformer architecture if it is further validated. - Even without the performance gains, the efficiency gains (particularly without having to modify the pretraining loss) are quite relevant and make Selective Attention look like a viable alternative to other efficient attention mechanisms.

Weaknesses

- The main problem of this paper is weak experimental validation. The authors only show gains in the language modelling task (using the relatively noisy and deprecated C4 dataset) and on a single downstream task (where they show smaller gains). They don't compare to existing pretrained models or other efficient attention mechanisms. While the (limited) experimental results are promising, they are not enough to validate the approach. I suggest replicating the recipe of an existing (state-of-the-a

Reviewer 03Rating 5Confidence 4

Strengths

- The motivation is clear, and the execution of the idea to not to tokens that are already attended to is clever and clean - The proposed method is easy to implement, adds no parameters, and saves memory overhead - Interesting results and analysis

Weaknesses

- My main concern is about the limited empirical evaluation where only one downstream task is considered. As the paper argues, "different tasks have different requirements," it is crucial to explore whether selective attention is broadly applicable by evaluating it on a diverse set of tasks with different requirements. This can be complemented with synthetic evaluations that might require the model to store the entire sequence, e.g., counting the number of a certain token in a sequence. - Besid

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeuroscience, Education and Cognitive Function

MethodsSoftmax · Attention Is All You Need