Selective Attention: Enhancing Transformer through Principled Context   Control

Xuechen Zhang; Xiangyu Chang; Mingchen Li; Amit Roy-Chowdhury; Jiasi; Chen; Samet Oymak

arXiv:2411.12892·cs.LG·November 21, 2024

Selective Attention: Enhancing Transformer through Principled Context Control

Xuechen Zhang, Xiangyu Chang, Mingchen Li, Amit Roy-Chowdhury, Jiasi, Chen, Samet Oymak

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Selective Self-Attention (SSA), a lightweight enhancement to transformers that uses temperature scaling to control attention sparsity and relevance, improving language modeling performance.

Contribution

The paper proposes SSA, a novel method that applies principled temperature scaling to attention mechanisms, enabling better control over contextual relevance and sparsity in transformers.

Findings

01

SSA improves language modeling accuracy across benchmarks.

02

Temperature scaling reduces attention dilution and noise.

03

Lightweight with less than 0.5% additional parameters.

Abstract

The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same way by applying the mapping $V^{⊤} softmax (K q)$ , where $V, K$ are the value and key embeddings respectively. In this work, we argue that this uniform treatment hinders the ability to control contextual sparsity and relevance. As a solution, we introduce the $Selective Self-Attention$ (SSA) layer that augments the softmax nonlinearity with a principled temperature scaling strategy. By controlling temperature, SSA adapts the contextual sparsity of the attention map to the query embedding and its position in the context window. Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

umich-sota/selective_attention
pytorchOfficial

Videos

Selective Attention: Enhancing Transformer through Principled Context Control· slideslive

Taxonomy

TopicsEmotion and Mood Recognition

MethodsAttention Is All You Need · Softmax