TL;DR
This paper reveals how attention sinks in Transformers cause gradient concentration, with massive activations acting as adaptive regulators during training, and introduces a method to modulate this effect.
Contribution
It provides a novel theoretical and empirical analysis of gradient sinks and massive activations, linking them to attention sinks and RMSNorm in Transformers.
Findings
Attention sinks induce gradient concentration under causal masking.
Massive activations act as regulators of gradient pressure via RMSNorm.
Modulating gradients with V-scale suppresses massive activations while preserving attention sinks.
Abstract
Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing explanations have largely focused on the forward pass, yet in pre-norm Transformers, large residual-stream norms play only an indirect forward role because sublayers operate on normalized inputs. We study this relationship from the perspective of backpropagation. Empirically and theoretically, we show that under causal masking, attention sinks can induce pronounced gradient concentration, which we term gradient sinks. Since the RMSNorm Jacobian attenuates gradients roughly in inverse proportion to input norm, massive activations can be understood as adaptive regulators of this localized gradient pressure during training. This interpretation predicts that attenuating sink-induced gradients should weaken massive activations. We test this prediction with V-scale, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
