On the Existence and Behavior of Secondary Attention Sinks
Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao

TL;DR
This paper investigates secondary attention sinks in neural networks, revealing their formation, properties, and impact across multiple models, and distinguishes them from primary sinks like the BOS token.
Contribution
It introduces the concept of secondary sinks, analyzes their formation in middle layers, and explores their influence on attention mechanisms across various model architectures.
Findings
Secondary sinks are formed by specific middle-layer MLP modules.
The sink score is determined by the $\, ext{l}_2$-norm of the MLP output vectors.
Larger models exhibit more deterministic and frequent sink levels.
Abstract
Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Mind wandering and attention · Big Data and Digital Economy
