Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Jiajun Song, Zhuoyan Xu, Yiqiao Zhong

TL;DR
This paper investigates how large language models generalize to out-of-distribution tasks by composing self-attention layers, revealing a shared latent space that facilitates rule inference and OOD generalization.
Contribution
It uncovers the role of induction heads and a shared latent subspace in enabling OOD generalization through compositional mechanisms in Transformers.
Findings
Models learn rules by composing self-attention layers.
A shared latent subspace aligns early and later layers.
Composition via induction heads improves OOD generalization.
Abstract
Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data -- which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of components known as induction heads. We found…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPlasma Diagnostics and Applications
MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax
