How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
Quan Nguyen, Thanh Nguyen-Tang

TL;DR
This paper investigates how transformers trained on in-context recall tasks learn, converge, and generalize, providing theoretical proofs of optimality, convergence rates, and out-of-distribution generalization, supported by empirical validation.
Contribution
It offers the first formal analysis of the training dynamics and generalization of transformers on in-context recall tasks, proving Bayes-optimality and convergence properties.
Findings
Transformers with linear, ReLU, or softmax attention are Bayes-optimal for recall tasks.
Expected loss converges linearly to Bayes risk during training.
Trained transformers can generalize out-of-distribution, but larger models may fail without proper parameterization.
Abstract
We study the approximation capabilities, convergence speeds and on-convergence behaviors of transformers trained on in-context recall tasks -- which requires to recognize the \emph{positional} association between a pair of tokens from in-context examples. Existing theoretical results only focus on the in-context reasoning behavior of transformers after being trained for the \emph{one} gradient descent step. It remains unclear what is the on-convergence behavior of transformers being trained by gradient descent and how fast the convergence rate is. In addition, the generalization of transformers in one-step in-context reasoning has not been formally investigated. This work addresses these gaps. We first show that a class of transformers with either linear, ReLU or softmax attentions, is provably Bayes-optimal for an in-context recall task. When being trained with gradient descent, we…
Peer Reviews
Decision·Submitted to ICLR 2026
- Clear theoretical guarantees across three attention types. Linear/ReLU (Lemma 3.1, Thm. 3.2) and softmax (Lemma 4.1, Thm. 4.2) get explicit parameterizations with linear convergence proofs. - OOD to unseen outputs is formalized and proved in both noiseless and noisy settings (Thm. 3.3, Thm. 5.5). - Mechanistic interpretability hook: theorem showing attention predicts outputs while FFN handles noise after enough steps (Thm. 5.6).
1. The abstract states: Existing theoretical results only focus on the in-context reasoning behavior of transformers after being trained for the one gradient descent step. This is not correct. Several papers analyze full training dynamics over many GD steps and prove convergence (often linear/finite‐time), not merely “one step” (you cited the first one, and didn't cite the last two), e.g.: [1] Huang, Cheng & Liang (2023). In-context convergence of transformers. and In-context learning with repr
1. The setting of in-context reasoning is an important and interesting problem to study. 2. The analysis is comprehensive, which contains many aspects of the theory.
1. The analysis is simplified to consider only $\lambda$ as the trainable parameter. This is too restrictive. 2. The writing can be improved. Why not put real-world examples right after Definition 2.1? 3. The experiments only show the necessity of reparameterization. However, I think it only applies to synthetic experiments with two-layer models. It is clear whether reparameterization is important in real-world experiments.
1. This paper extends prior studies that only analyzed the first training step or infinite-sample limits by providing finite-sample analyses, explicit reparameterizations, and empirical validations demonstrating when proper parameterization is crucial for OOD generalization 2. This paper theoretically characterizes the different behaviors of the feed-forward layer and the attention layer in in-context recall tasks, which is insightful.
1. Limited architecture depth: Results are confined to one-layer, single-head transformers, far from the multi-layer, residual, or multi-head dynamics that dominate real LLMs. 2. The experiments in this paper focus on synthetic tasks; it would be better if the authors could consider real-world language tasks.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Machine Learning and Data Classification · Topic Modeling
MethodsFocus · *Communicated@Fast*How Do I Communicate to Expedia?
