On the Expressivity Role of LayerNorm in Transformers' Attention

Shaked Brody; Uri Alon; Eran Yahav

arXiv:2305.02582·cs.LG·May 12, 2023·1 cites

On the Expressivity Role of LayerNorm in Transformers' Attention

Shaked Brody, Uri Alon, Eran Yahav

PDF

Open Access 1 Repo

TL;DR

This paper reveals that LayerNorm enhances the expressivity of Transformer attention by projecting and scaling input vectors, enabling more effective attention mechanisms and improving language modeling and simple function computation.

Contribution

It provides a geometric interpretation of LayerNorm, demonstrating its crucial role in enabling attention mechanisms to attend uniformly and avoid unselectable keys, which was previously underappreciated.

Findings

01

LayerNorm's projection component allows uniform attention to all keys.

02

Scaling in LayerNorm prevents keys from being unselectable.

03

Transformers benefit from LayerNorm's properties in language modeling and simple functions.

Abstract

Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm's only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geometric interpretation of LayerNorm and show that it consists of two components: (a) projection of the input vectors to a $d - 1$ space that is orthogonal to the $[1, 1, ..., 1]$ vector, and (b) scaling of all vectors to the same norm of $d$ . We show that each of these components is important for the attention layer that follows it in Transformers: (a) projection allows the attention mechanism to create an attention query that attends to all keys equally, offloading the need to learn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tech-srl/layer_norm_expressivity_role
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques

MethodsSoftmax · Linear Layer