Effects of Parameter Norm Growth During Transformer Training: Inductive   Bias from Gradient Descent

William Merrill; Vivek Ramanujan; Yoav Goldberg; Roy Schwartz; and Noah Smith

arXiv:2010.09697·cs.LG·March 9, 2023

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, and Noah Smith

PDF

1 Repo

TL;DR

This paper investigates how parameter norms grow during transformer training, revealing that this growth leads to saturated networks with reduced capacity, which introduces an inductive bias affecting the network's emergent representations and attention mechanisms.

Contribution

It demonstrates empirically that parameter norm growth causes saturation in transformers, providing a theoretical understanding of the implicit inductive bias in gradient descent training.

Findings

01

Parameter norms grow during training of transformers.

02

Saturation leads to simplified, discrete network structures.

03

Different attention heads specialize in local or global computations.

Abstract

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ( $ℓ_{2}$ norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

viking-sudo-rm/norm-growth
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Gated Linear Unit · Attention Is All You Need · Inverse Square Root Schedule · Byte Pair Encoding · Softmax · Layer Normalization · Adafactor · Dense Connections · Multi-Head Attention