Always Skip Attention

Yiping Ji; Hemanth Saratchandran; Peyman Moghadam; Simon Lucey

arXiv:2505.01996·cs.LG·August 20, 2025

Always Skip Attention

Yiping Ji, Hemanth Saratchandran, Peyman Moghadam, Simon Lucey

PDF

Open Access

TL;DR

This paper reveals that self-attention in Vision Transformers requires skip connections for effective training due to its ill-conditioned nature, and introduces Token Graying as a complementary technique to enhance training stability.

Contribution

It provides a theoretical analysis of why skip connections are essential for self-attention in ViTs and proposes Token Graying to improve input conditioning.

Findings

01

Self-attention fails to train without skip connections.

02

Token Graying improves training stability and performance.

03

Skip connections' importance is a recent phenomenon in deep architectures.

Abstract

We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Visual Attention and Saliency Detection