On the Emergence of Induction Heads for In-Context Learning

Tiberiu Musat; Tiago Pimentel; Lorenzo Noci; Alessandro Stolfo; Mrinmaya Sachan; Thomas Hofmann

arXiv:2511.01033·cs.AI·January 12, 2026

On the Emergence of Induction Heads for In-Context Learning

Tiberiu Musat, Tiago Pimentel, Lorenzo Noci, Alessandro Stolfo, Mrinmaya Sachan, Thomas Hofmann

PDF

Open Access 4 Reviews

TL;DR

This paper investigates how induction heads emerge in transformer models, revealing their interpretable structure, theoretical origins, and the dynamics governing their development during training, which enhances understanding of in-context learning mechanisms.

Contribution

It provides a theoretical and empirical analysis of induction head emergence, including a minimal task formulation, a proof of constrained training dynamics, and insights into the dimensionality of the process.

Findings

01

Induction heads have a simple, interpretable structure.

02

Training dynamics are constrained to a 19-dimensional subspace.

03

Emergence time of induction heads scales quadratically with input length.

Abstract

Transformers have become the dominant architecture for natural language processing. Part of their success is owed to a remarkable capability known as in-context learning (ICL): they can acquire and apply novel associations solely from their input context, without any updates to their weights. In this work, we study the emergence of induction heads, a previously identified mechanism in two-layer transformers that is particularly important for in-context learning. We uncover a relatively simple and interpretable structure of the weight matrices implementing the induction head. We theoretically explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture. We give a formal proof that the training dynamics remain constrained to a 19-dimensional subspace of the parameter space. Empirically, we validate this constraint while observing that…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 2

Strengths

Three‑parameter dynamics, order gamma->beta-> alpha and the quadratic scaling, are well‑motivated inside the minimal setup, with closed‑form losses and derivatives (Appendix B) and consistent empirical plots (Fig. 5).

Weaknesses

The paper has over‑reliance on strong, non‑standard assumptions in Transformer. The key results hinge on concatenated residuals, isotropy data, and even forbid self‑attention to the current position, which are explicitly not a standard practice. Since the proofs and most experiments apply to disentangled residuals under isotropy and other constraints, the claims and title should explicitly limit scope; otherwise readers may infer general statements about standard transformers that are not actual

Reviewer 02Rating 6Confidence 3

Strengths

1. Fundamental Problem: The paper tackles a core question in mechanistic interpretability: not just what circuits (like induction heads) exist, but how they are learned by gradient descent. Moving from a static to a dynamic analysis is a valuable contribution. 2. Novel Theoretical Result: Theorem 1, which proves that training dynamics are constrained to a 19-dimensional subspace, is a strong and elegant result. The proof technique using data isotropy and rotational symmetry is clever. 3. Concr

Weaknesses

My primary concerns relate to the significant gap between the highly simplified model used for the theory and a standard transformer, as well as several restrictive assumptions and internal inconsistencies that call the generality of the main results into question. 1. Gap Between Standard and Minimal Architectures: The paper begins by motivating the problem with a "standard" attention-only transformer (§2, Fig. 2) but then immediately pivots to a highly artificial, minimal architecture for all

Reviewer 03Rating 4Confidence 4

Strengths

The analysis of this minimal model is interesting and convincing. Theorems 1 and 2 show what's going on in this little model on this induction head learning task.

Weaknesses

For this reviewer, it worth remembering that we are examining the emergence of a particular behavior from a task that was designed to induce this behavior. Thus the results have to be treated with caution. First Olsson et al 2022 and Reddy (2023) give suggestive evidence that icl patterns with a certain learnable behavior (induction head behavior and with a certain loss across token indices). But it's not terribly surprising that even very simple models can learn this copying behavior. Af

Reviewer 04Rating 6Confidence 2

Strengths

1. **Theoretical Rigor:** The paper provides a complete, end-to-end theoretical analysis of its minimal model, from subspace constraints (Theorem 1) to the final scaling laws (Theorem 2). 2. **Clear Identification of Mechanism:** The empirical reduction from the 19-dimensional theoretical subspace to a 3-dimensional functional subspace ($\alpha_3, \beta_2, \gamma_3$) is a clean and insightful result. 3. **Novel Scaling Law:** The derivation of the $t_{ICL} = \Theta(N^2)$ scaling law is a conc

Weaknesses

1. **Overly-Strong Assumptions:** The primary weakness is the reliance on a set of highly non-standard and simplifying assumptions, most critically **zero initialization**. This assumption is key to the symmetry argument of the main proof (Theorem 1) but is not representative of standard transformer training, thus limiting the applicability of all subsequent results. 2. **Artificial Model and Data:** The theoretical analysis is not on a standard transformer, but a "disentangled" model. The dat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Language and cultural evolution · Ferroelectric and Negative Capacitance Devices