How Transformers Get Rich: Approximation and Dynamics Analysis
Mingze Wang, Ruoxi Yu, Weinan E, Lei Wu

TL;DR
This paper provides a detailed theoretical and dynamic analysis of how transformers implement induction heads, revealing an abrupt transition from lazy to rich in-context learning mechanisms during training.
Contribution
It formalizes induction head mechanisms and analyzes their implementation and training dynamics, offering new insights into transformer in-context learning.
Findings
Transformers can efficiently implement induction heads.
Training exhibits an abrupt transition from lazy to rich mechanisms.
Dynamics analysis reveals the training process of induction heads.
Abstract
Transformers have demonstrated exceptional in-context learning capabilities, yet the theoretical understanding of the underlying mechanisms remains limited. A recent work (Elhage et al., 2021) identified a ``rich'' in-context mechanism known as induction head, contrasting with ``lazy'' -gram models that overlook long-range dependencies. In this work, we provide both approximation and dynamics analyses of how transformers implement induction heads. In the {\em approximation} analysis, we formalize both standard and generalized induction head mechanisms, and examine how transformers can efficiently implement them, with an emphasis on the distinct role of each transformer submodule. For the {\em dynamics} analysis, we study the training dynamics on a synthetic mixed target, composed of a 4-gram and an in-context 2-gram component. This controlled setting allows us to precisely…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Given that induction heads are widely assumed to be critical for in-context learning, their formation dynamics have become a focal point of recent research. This paper contributes to this area by establishing grounding definitions for induction heads and studying how these mechanisms are represented within the Transformer architecture. The proposed simplified model architectures and a specialized task isolate and analyze their formation better.
**Validity of Theorems:** There appears to be an issue with Theorems 3.3 and 3.4 (a specific question regarding this is detailed below) **Mischaracterization of "Lazy" Learning** I disagree with the paper's description of the task dynamics as a "lazy" phenomenon. The learning of $f_{G_4}^*$ is still a form of feature learning and does not align with the formal definitions of "lazy" (or kernel-regime) learning established in prior work (e.g., Chizat et al., 2018; Woodworth et al., 2020). **Depe
1. The induction head is an interesting and important mechanism in transformer research, and this paper constructs a comprehensive theoretical framework for it. The modeling of the induction head is intuitively reasonable. Its progressive analysis is logical and supported by rigorous theoretical proof. 2. Many proofs in transformer theory research involve artificially constructing the model's weights for subsequent analysis. While the first part of this paper also utilizes this technique, the st
1. I think the analysis of the approximation of induction head seems to have appeared in previous work, such as [1]. The difference between this work and previous papers may be that it studies the training dynamics from 4-gram to the induction head. This may weaken the contribution of this paper. 2. This paper demonstrates in the approximation part that the transformer can achieve induction heads by constructing parameters. Although there is an analysis of the training dynamics later, this does
Though I did not carefully check the proof, the technical details seem sound to me. The presentation of the paper is clear and logical. The work also proposes a clean, analyzable setting, and the experimental evidence is aligned with the theory. The paper gives a unified conceptual bridge between two areas that are usually disconnected: (i) mechanistic accounts of “induction heads” and (ii) the actual temporal trajectory of training under gradient-based optimization. Even if individual ingredi
However, I have some major concerns about the current work: **1. Realism / motivation of the mixed target** The core dynamics result is proved in a very specific “mixed target’’ setting where the ground truth is a convex combination of (i) a handpicked 4-gram rule and (ii) a vanilla 2-gram induction-head-style copying rule. It’s not obvious when this exact mixture arises in real next-token prediction. The paper justifies 4-gram instead of 2/3-gram mainly to avoid trivial cases where the model
- The paper provides a theoretical understanding of the working of induction heads in transformers. First, in approximation analysis, it shows that a transformer with 2 attention layers without FFNs can achieve induction heads by proving it by construction. - For analyzing the training dynamics, the paper proposes a target function consisting of 2-gram and 4-gram components, then a layerwise training is done to show that the model first learns the induction head, followed by the second stage of
- The results in the approximation analysis is one way of constructing and explaining the working of transformers. Can the authors provide further evidence if the constructions used in the proofs are indeed how the 2-layered transformers work? E.g., moving from Thm3.3 to Thm3.4, in line 277, it is discussed that the FFN is used for approximation, which is intuitively correct, but is there a way to check if this actually is what is happening in the transformer? There are several other such constr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElectric Motor Design and Analysis · Oil and Gas Production Techniques · Electric Power Systems and Control
