Data Distributional Properties Drive Emergent In-Context Learning in   Transformers

Stephanie C.Y. Chan; Adam Santoro; Andrew K. Lampinen; Jane X. Wang,; Aaditya Singh; Pierre H. Richemond; Jay McClelland; Felix Hill

arXiv:2205.05055·cs.LG·November 18, 2022·52 cites

Data Distributional Properties Drive Emergent In-Context Learning in Transformers

Stephanie C.Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang,, Aaditya Singh, Pierre H. Richemond, Jay McClelland, Felix Hill

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper investigates how specific distributional properties of training data, like burstiness and skewness, enable emergent in-context learning in transformer models, highlighting the interplay between data characteristics and model architecture.

Contribution

It reveals that naturalistic data distributions, especially skewed and bursty ones, are crucial for in-context learning emergence in transformers, contrasting with traditional i.i.d. training assumptions.

Findings

01

In-context learning emerges with bursty and skewed data distributions.

02

Models can learn both in-context and weight-based learning when trained on skewed data.

03

Naturalistic data distributions are effective only in transformers, not recurrent models.

Abstract

Large transformer-based models are able to perform in-context few-shot learning, without being explicitly trained for it. This observation raises the question: what aspects of the training regime lead to this emergent behavior? Here, we show that this behavior is driven by the distributions of the training data itself. In-context learning emerges when the training data exhibits particular distributional properties such as burstiness (items appear in clusters rather than being uniformly distributed over time) and having large numbers of rarely occurring classes. In-context learning also emerges more strongly when item meanings or interpretations are dynamic rather than fixed. These properties are exemplified by natural language, but are also inherent to naturalistic data in a wide range of other domains. They also depart significantly from the uniform, i.i.d. training distributions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Data Distributional Properties Drive Emergent In-Context Learning in Transformers· slideslive

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques