Controlling changes to attention logits

Ben Anson; Laurence Aitchison

arXiv:2511.21377·cs.LG·November 27, 2025

Controlling changes to attention logits

Ben Anson, Laurence Aitchison

PDF

Open Access

TL;DR

This paper proposes a parameter-dependent learning rate method to control attention logit changes, enhancing stability and performance in transformer models, especially when QK normalization is incompatible.

Contribution

It introduces a simple, cost-effective approach to stabilize attention logits by adjusting learning rates, outperforming existing methods like QK norm in certain settings.

Findings

01

Improved stability with higher learning rates

02

Outperforms QK norm in MLA setting

03

Achieves competitive performance with multi-head attention

Abstract

Stability of neural network weights is critical when training transformer models. The query and key weights are particularly problematic, as they tend to grow large without any intervention. Applying normalization to queries and keys, known as `QK norm', fixes stability issues in practice, but is not always applicable. For example, QK norm is not compatible with Multi Latent Attention (MLA) because QK norm requires full materialization of queries and keys during inference, which is not done in MLA. In this paper we suggest that controlling the changes to logits is important for stability. We show that these changes are controllable by assigning parameter-dependent learning rates to the query and key weights. We find that our cheap intervention allows us to increase the base learning rate of the network, outperform other methods in the MLA setting, and achieve performance competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Advanced Graph Neural Networks