How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization

Quan Nguyen; Thanh Nguyen-Tang

arXiv:2505.15009·cs.LG·October 22, 2025

How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization

Quan Nguyen, Thanh Nguyen-Tang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates how transformers trained on in-context recall tasks learn, converge, and generalize, providing theoretical proofs of optimality, convergence rates, and out-of-distribution generalization, supported by empirical validation.

Contribution

It offers the first formal analysis of the training dynamics and generalization of transformers on in-context recall tasks, proving Bayes-optimality and convergence properties.

Findings

01

Transformers with linear, ReLU, or softmax attention are Bayes-optimal for recall tasks.

02

Expected loss converges linearly to Bayes risk during training.

03

Trained transformers can generalize out-of-distribution, but larger models may fail without proper parameterization.

Abstract

We study the approximation capabilities, convergence speeds and on-convergence behaviors of transformers trained on in-context recall tasks -- which requires to recognize the \emph{positional} association between a pair of tokens from in-context examples. Existing theoretical results only focus on the in-context reasoning behavior of transformers after being trained for the \emph{one} gradient descent step. It remains unclear what is the on-convergence behavior of transformers being trained by gradient descent and how fast the convergence rate is. In addition, the generalization of transformers in one-step in-context reasoning has not been formally investigated. This work addresses these gaps. We first show that a class of transformers with either linear, ReLU or softmax attentions, is provably Bayes-optimal for an in-context recall task. When being trained with gradient descent, we…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- Clear theoretical guarantees across three attention types. Linear/ReLU (Lemma 3.1, Thm. 3.2) and softmax (Lemma 4.1, Thm. 4.2) get explicit parameterizations with linear convergence proofs. - OOD to unseen outputs is formalized and proved in both noiseless and noisy settings (Thm. 3.3, Thm. 5.5). - Mechanistic interpretability hook: theorem showing attention predicts outputs while FFN handles noise after enough steps (Thm. 5.6).

Weaknesses

1. The abstract states: Existing theoretical results only focus on the in-context reasoning behavior of transformers after being trained for the one gradient descent step. This is not correct. Several papers analyze full training dynamics over many GD steps and prove convergence (often linear/finite‐time), not merely “one step” (you cited the first one, and didn't cite the last two), e.g.: [1] Huang, Cheng & Liang (2023). In-context convergence of transformers. and In-context learning with repr

Reviewer 02Rating 4Confidence 4

Strengths

1. The setting of in-context reasoning is an important and interesting problem to study. 2. The analysis is comprehensive, which contains many aspects of the theory.

Weaknesses

1. The analysis is simplified to consider only $\lambda$ as the trainable parameter. This is too restrictive. 2. The writing can be improved. Why not put real-world examples right after Definition 2.1? 3. The experiments only show the necessity of reparameterization. However, I think it only applies to synthetic experiments with two-layer models. It is clear whether reparameterization is important in real-world experiments.

Reviewer 03Rating 6Confidence 4

Strengths

1. This paper extends prior studies that only analyzed the first training step or infinite-sample limits by providing finite-sample analyses, explicit reparameterizations, and empirical validations demonstrating when proper parameterization is crucial for OOD generalization 2. This paper theoretically characterizes the different behaviors of the feed-forward layer and the attention layer in in-context recall tasks, which is insightful.

Weaknesses

1. Limited architecture depth: Results are confined to one-layer, single-head transformers, far from the multi-layer, residual, or multi-head dynamics that dominate real LLMs. 2. The experiments in this paper focus on synthetic tasks; it would be better if the authors could consider real-world language tasks.

Code & Models

Repositories

ngmq/onelayer-transformer-ICR-DA-NTP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Machine Learning and Data Classification · Topic Modeling

MethodsFocus · *Communicated@Fast*How Do I Communicate to Expedia?