Selective Attention Improves Transformer
Yaniv Leviathan, Matan Kalman, Yossi Matias

TL;DR
This paper introduces Selective Attention, a simple, parameter-free modification to the standard attention mechanism that enhances language modeling performance and reduces memory and compute costs by focusing attention on relevant elements.
Contribution
The paper proposes Selective Attention, a novel, parameter-free method that improves transformer performance and efficiency by selectively reducing attention to unneeded elements.
Findings
Improves language modeling performance across various model sizes.
Reduces memory and compute requirements during inference.
Achieves comparable performance with fewer attention heads and smaller context sizes.
Abstract
Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention consistently improves language modeling and downstream task performance in a variety of model sizes and context lengths. For example, transformers trained with the language modeling objective on C4 with selective attention perform language modeling equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention…
Peer Reviews
Decision·ICLR 2025 Poster
1. **Efficient Memory Management**: The Selective Attention mechanism effectively prunes unneeded tokens, significantly reducing memory usage during inference without degrading model performance. This efficiency gain is particularly valuable for scaling transformers in resource-constrained environments. 2. **No Additional Parameters**: Selective Attention operates without introducing new parameters or significantly increasing computational overhead, which preserves the simplicity of the transfor
1. **Limited Scope of Model Architectures**: The experiments are primarily conducted on decoder-only transformer models. Further analysis is needed to verify if Selective Attention can similarly benefit encoder models or encoder-decoder models used in other tasks, such as translation or summarization. 2. **Potential Over-Reliance on Hyperparameter Tuning**: Selective Attention’s performance may depend on optimal memory budget settings per layer, which could complicate deployment in different tas
- Given the simplicity of the proposed approach and the reported performance gains, it seems that selective attention could be a significant addition to the transformer architecture if it is further validated. - Even without the performance gains, the efficiency gains (particularly without having to modify the pretraining loss) are quite relevant and make Selective Attention look like a viable alternative to other efficient attention mechanisms.
- The main problem of this paper is weak experimental validation. The authors only show gains in the language modelling task (using the relatively noisy and deprecated C4 dataset) and on a single downstream task (where they show smaller gains). They don't compare to existing pretrained models or other efficient attention mechanisms. While the (limited) experimental results are promising, they are not enough to validate the approach. I suggest replicating the recipe of an existing (state-of-the-a
- The motivation is clear, and the execution of the idea to not to tokens that are already attended to is clever and clean - The proposed method is easy to implement, adds no parameters, and saves memory overhead - Interesting results and analysis
- My main concern is about the limited empirical evaluation where only one downstream task is considered. As the paper argues, "different tasks have different requirements," it is crucial to explore whether selective attention is broadly applicable by evaluating it on a diverse set of tasks with different requirements. This can be complemented with synthetic evaluations that might require the model to store the entire sequence, e.g., counting the number of a certain token in a sequence. - Besid
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeuroscience, Education and Cognitive Function
MethodsSoftmax · Attention Is All You Need
