Attention is Not All You Need: Pure Attention Loses Rank Doubly   Exponentially with Depth

Yihe Dong; Jean-Baptiste Cordonnier; Andreas Loukas

arXiv:2103.03404·cs.LG·August 2, 2023·71 cites

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper reveals that pure self-attention layers in transformers tend to produce rank-1 matrices exponentially fast, and that skip connections and MLPs prevent this degeneration, providing new insights into transformer design.

Contribution

The work introduces a novel decomposition of self-attention outputs and proves their tendency towards rank-1 matrices without skip connections or MLPs, explaining their effectiveness and limitations.

Findings

01

Pure attention converges doubly exponentially to rank-1 matrices.

02

Skip connections and MLPs prevent output degeneration.

03

Experimental verification across transformer variants.

Abstract

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

twistedcubic/attention-rank-collapse
pytorchOfficial

Videos

Attention is not all you need: pure attention loses rank doubly exponentially with depth· slideslive

Taxonomy

TopicsNeural Networks and Applications · Neural Networks and Reservoir Computing · Advanced Memory and Neural Computing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax