Asymptotic theory of in-context learning by linear attention

Yue M. Lu; Mary I. Letey; Jacob A. Zavatone-Veth; Anindita Maiti; Cengiz Pehlevan

arXiv:2405.11751·stat.ML·October 6, 2025·1 cites

Asymptotic theory of in-context learning by linear attention

Yue M. Lu, Mary I. Letey, Jacob A. Zavatone-Veth, Anindita Maiti, Cengiz Pehlevan

PDF

Open Access 1 Repo

TL;DR

This paper provides a theoretical analysis of in-context learning in Transformers, revealing how task diversity, context length, and sample complexity influence learning behavior, with empirical validation on linear and nonlinear models.

Contribution

It introduces an exactly solvable model of ICL with linear attention, deriving asymptotics and identifying a phase transition between memorization and genuine learning regimes.

Findings

01

Double-descent learning curve observed with increasing pretraining examples.

02

Phase transition between memorization and true in-context learning regimes.

03

Empirical validation on linear and nonlinear Transformer architectures.

Abstract

Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Pehlevan-Group/icl-asymptotic
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout