Selective Attention: Enhancing Transformer through Principled Context Control
Xuechen Zhang, Xiangyu Chang, Mingchen Li, Amit Roy-Chowdhury, Jiasi, Chen, Samet Oymak

TL;DR
This paper introduces Selective Self-Attention (SSA), a lightweight enhancement to transformers that uses temperature scaling to control attention sparsity and relevance, improving language modeling performance.
Contribution
The paper proposes SSA, a novel method that applies principled temperature scaling to attention mechanisms, enabling better control over contextual relevance and sparsity in transformers.
Findings
SSA improves language modeling accuracy across benchmarks.
Temperature scaling reduces attention dilution and noise.
Lightweight with less than 0.5% additional parameters.
Abstract
The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries in the same way by applying the mapping , where are the value and key embeddings respectively. In this work, we argue that this uniform treatment hinders the ability to control contextual sparsity and relevance. As a solution, we introduce the (SSA) layer that augments the softmax nonlinearity with a principled temperature scaling strategy. By controlling temperature, SSA adapts the contextual sparsity of the attention map to the query embedding and its position in the context window. Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEmotion and Mood Recognition
MethodsAttention Is All You Need · Softmax
