TL;DR
This paper investigates how parameter norms grow during transformer training, revealing that this growth leads to saturated networks with reduced capacity, which introduces an inductive bias affecting the network's emergent representations and attention mechanisms.
Contribution
It demonstrates empirically that parameter norm growth causes saturation in transformers, providing a theoretical understanding of the implicit inductive bias in gradient descent training.
Findings
Parameter norms grow during training of transformers.
Saturation leads to simplified, discrete network structures.
Different attention heads specialize in local or global computations.
Abstract
The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ( norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Gated Linear Unit · Attention Is All You Need · Inverse Square Root Schedule · Byte Pair Encoding · Softmax · Layer Normalization · Adafactor · Dense Connections · Multi-Head Attention
