On the Expressivity Role of LayerNorm in Transformers' Attention
Shaked Brody, Uri Alon, Eran Yahav

TL;DR
This paper reveals that LayerNorm enhances the expressivity of Transformer attention by projecting and scaling input vectors, enabling more effective attention mechanisms and improving language modeling and simple function computation.
Contribution
It provides a geometric interpretation of LayerNorm, demonstrating its crucial role in enabling attention mechanisms to attend uniformly and avoid unselectable keys, which was previously underappreciated.
Findings
LayerNorm's projection component allows uniform attention to all keys.
Scaling in LayerNorm prevents keys from being unselectable.
Transformers benefit from LayerNorm's properties in language modeling and simple functions.
Abstract
Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm's only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geometric interpretation of LayerNorm and show that it consists of two components: (a) projection of the input vectors to a space that is orthogonal to the vector, and (b) scaling of all vectors to the same norm of . We show that each of these components is important for the attention layer that follows it in Transformers: (a) projection allows the attention mechanism to create an attention query that attends to all keys equally, offloading the need to learn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques
MethodsSoftmax · Linear Layer
