Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Jeongin Bae; Baeseong Park; Gunho Park; Minsub Kim; Joonhyung Lee; Junhee Yoo; Sunghyeon Woo; Jiwon Ryu; Se Jung Kwon; Dongsoo Lee

arXiv:2602.23057·cs.CL·February 27, 2026

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee, Junhee Yoo, Sunghyeon Woo, Jiwon Ryu, Se Jung Kwon, Dongsoo Lee

PDF

Open Access 1 Models

TL;DR

This paper introduces Affine-Scaled Attention, a modification to standard transformer attention that allows for input-dependent scaling and bias, improving training stability and performance in language models.

Contribution

The paper presents a novel attention mechanism that relaxes softmax normalization constraints, enabling better control over attention magnitudes and enhancing model training and performance.

Findings

01

Improved training stability in large-scale language models.

02

Enhanced downstream task performance.

03

Consistent benefits over standard softmax attention.

Abstract

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KitsuVp/NeoLLM
model· 2.9k dl· ♡ 1
2.9k dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning