Always Skip Attention
Yiping Ji, Hemanth Saratchandran, Peyman Moghadam, Simon Lucey

TL;DR
This paper reveals that self-attention in Vision Transformers requires skip connections for effective training due to its ill-conditioned nature, and introduces Token Graying as a complementary technique to enhance training stability.
Contribution
It provides a theoretical analysis of why skip connections are essential for self-attention in ViTs and proposes Token Graying to improve input conditioning.
Findings
Self-attention fails to train without skip connections.
Token Graying improves training stability and performance.
Skip connections' importance is a recent phenomenon in deep architectures.
Abstract
We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Visual Attention and Saliency Detection
