Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers

Yihong Chen; Zhouchen Lin; Quanming Yao

arXiv:2603.17771·cs.LG·May 7, 2026

Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers

Yihong Chen, Zhouchen Lin, Quanming Yao

PDF

1 Repo

TL;DR

This paper reveals how attention sinks in Transformers cause gradient concentration, with massive activations acting as adaptive regulators during training, and introduces a method to modulate this effect.

Contribution

It provides a novel theoretical and empirical analysis of gradient sinks and massive activations, linking them to attention sinks and RMSNorm in Transformers.

Findings

01

Attention sinks induce gradient concentration under causal masking.

02

Massive activations act as regulators of gradient pressure via RMSNorm.

03

Modulating gradients with V-scale suppresses massive activations while preserving attention sinks.

Abstract

Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing explanations have largely focused on the forward pass, yet in pre-norm Transformers, large residual-stream norms play only an indirect forward role because sublayers operate on normalized inputs. We study this relationship from the perspective of backpropagation. Empirically and theoretically, we show that under causal masking, attention sinks can induce pronounced gradient concentration, which we term gradient sinks. Since the RMSNorm Jacobian attenuates gradients roughly in inverse proportion to input norm, massive activations can be understood as adaptive regulators of this localized gradient pressure during training. This interpretation predicts that attenuating sink-induced gradients should weaken massive activations. We test this prediction with V-scale, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/GradientSinkCode-B309
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.