Out-of-distribution generalization via composition: a lens through   induction heads in Transformers

Jiajun Song; Zhuoyan Xu; Yiqiao Zhong

arXiv:2408.09503·cs.CL·December 31, 2024

Out-of-distribution generalization via composition: a lens through induction heads in Transformers

Jiajun Song, Zhuoyan Xu, Yiqiao Zhong

PDF

Open Access 1 Repo

TL;DR

This paper investigates how large language models generalize to out-of-distribution tasks by composing self-attention layers, revealing a shared latent space that facilitates rule inference and OOD generalization.

Contribution

It uncovers the role of induction heads and a shared latent subspace in enabling OOD generalization through compositional mechanisms in Transformers.

Findings

01

Models learn rules by composing self-attention layers.

02

A shared latent subspace aligns early and later layers.

03

Composition via induction heads improves OOD generalization.

Abstract

Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data -- which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of components known as induction heads. We found…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiajunsong629/ood-generalization-via-composition
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPlasma Diagnostics and Applications

MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax