In-Context Learning Strategies Emerge Rationally

Daniel Wurgaft; Ekdeep Singh Lubana; Core Francisco Park; Hidenori Tanaka; Gautam Reddy; Noah D. Goodman

arXiv:2506.17859·cs.LG·June 27, 2025

In-Context Learning Strategies Emerge Rationally

Daniel Wurgaft, Ekdeep Singh Lubana, Core Francisco Park, Hidenori Tanaka, Gautam Reddy, Noah D. Goodman

PDF

TL;DR

This paper presents a hierarchical Bayesian framework that explains in-context learning strategies in transformers as rational adaptations balancing data fit and complexity, unifying diverse observed behaviors.

Contribution

It introduces a normative Bayesian model that predicts ICL behavior without relying on model weights, linking strategy selection to complexity and loss tradeoffs.

Findings

01

Predicts transformer predictions throughout training accurately

02

Explains the emergence of memorization and generalization strategies

03

Shows a superlinear increase in transition times with task diversity

Abstract

Recent work analyzing in-context learning (ICL) has identified a broad set of strategies that describe model behavior in different experimental conditions. We aim to unify these findings by asking why a model learns these disparate strategies in the first place. Specifically, we start with the observation that when trained to learn a mixture of tasks, as is popular in the literature, the strategies learned by a model for performing ICL can be captured by a family of Bayesian predictors: a memorizing predictor, which assumes a discrete prior on the set of seen tasks, and a generalizing predictor, where the prior matches the underlying task distribution. Adopting the normative lens of rational analysis, where a learner's behavior is explained as an optimal adaptation to data given computational constraints, we develop a hierarchical Bayesian framework that almost perfectly predicts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Sparse Evolutionary Training