On the Role of Attention Masks and LayerNorm in Transformers

Xinyi Wu; Amir Ajorlou; Yifei Wang; Stefanie Jegelka; Ali Jadbabaie

arXiv:2405.18781·cs.LG·November 4, 2024

On the Role of Attention Masks and LayerNorm in Transformers

Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie

PDF

Open Access 1 Video

TL;DR

This paper analyzes how attention masks and LayerNorm influence rank collapse in transformers, revealing that these components can slow or prevent collapse, thereby enhancing model expressivity and dynamics.

Contribution

It provides the first comprehensive analysis of rank collapse considering attention masks and LayerNorm, showing their roles in preventing collapse and increasing expressivity.

Findings

01

Sparse or local attention slows rank collapse.

02

LayerNorm can prevent exponential rank collapse with proper value matrices.

03

Transformers with LayerNorm have diverse equilibrium states.

Abstract

Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical components in transformers that may alleviate the rank collapse issue. In this paper, we provide a general analysis of rank collapse under self-attention, taking into account the effects of attention masks and layer normalization (LayerNorm). In particular, we find that although pure masked attention still suffers from exponential collapse to a rank one subspace, sparse or local masked attention can provably slow down the collapse rate. In the case of self-attention with LayerNorm, we first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Role of Attention Masks and LayerNorm in Transformers· slideslive

Taxonomy

TopicsVisual perception and processing mechanisms · Infrared Target Detection Methodologies

MethodsSparse Evolutionary Training · Layer Normalization