On the Role of Attention Masks and LayerNorm in Transformers
Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie

TL;DR
This paper analyzes how attention masks and LayerNorm influence rank collapse in transformers, revealing that these components can slow or prevent collapse, thereby enhancing model expressivity and dynamics.
Contribution
It provides the first comprehensive analysis of rank collapse considering attention masks and LayerNorm, showing their roles in preventing collapse and increasing expressivity.
Findings
Sparse or local attention slows rank collapse.
LayerNorm can prevent exponential rank collapse with proper value matrices.
Transformers with LayerNorm have diverse equilibrium states.
Abstract
Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical components in transformers that may alleviate the rank collapse issue. In this paper, we provide a general analysis of rank collapse under self-attention, taking into account the effects of attention masks and layer normalization (LayerNorm). In particular, we find that although pure masked attention still suffers from exponential collapse to a rank one subspace, sparse or local masked attention can provably slow down the collapse rate. In the case of self-attention with LayerNorm, we first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVisual perception and processing mechanisms · Infrared Target Detection Methodologies
MethodsSparse Evolutionary Training · Layer Normalization
