TL;DR
This paper provides a theoretical analysis of how transformer models learn in-context nonlinear regression tasks, revealing the dynamics of attention and conditions for successful learning based on the Lipschitz constant of target functions.
Contribution
It introduces new proof techniques to analyze attention dynamics in nonlinear regression, establishing how Lipschitz constants influence convergence and in-context learning capabilities.
Findings
Attention scores grow rapidly early and converge to one
Lipschitz constant L governs convergence speed
Transformers attend to relevant features at convergence
Abstract
The transformer architecture, which processes sequences of input tokens to produce outputs for query tokens, has revolutionized numerous areas of machine learning. A defining feature of transformers is their ability to perform previously unseen tasks using task specific prompts without updating parameters, a phenomenon known as in-context learning (ICL). Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks such as linear regression and binary classification. To advance the theoretical understanding of ICL, this paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities in these settings. We analyze the stage-wise dynamics of attention during training: attention scores between a query token and its target features grow rapidly in the early…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper studies the convergence of transformers during the pretraining stage, which is an interesting topic. 2. The discovery of the sharp and flat regimes, along with the two-phase transition observed in both the training loss and the attention scores during training, is particularly intriguing. 3. Experimental results show the two-phase transition of the pretraining stage, which coincides with the theory.
1. The paper focuses only on a single-layer Transformer, and extending the analysis to a multi-layer setting would provide valuable insights. 2. The technical approach appears to overlap with that of [1], which somewhat diminishes the paper’s original contribution. [1] Huang, Y., Cheng, Y., & Liang, Y. (2023). In-context convergence of transformers. arXiv preprint arXiv:2310.05249.
This is a training dynamics analysis, as opposed to characterization of global OPT. The paper is clear to read.
Features come from a discrete set? The optimal attention pattern is to attend to only exactly identical vectors.
- The paper removes strong assumptions made in prior work (such as linearity or orthogonal feature bases) and establishes theory for a broader class of problems. - The comparison with previous studies is clear. Existing results on optimization in in-context learning are well summarized, and the paper clearly explains how its contributions differ from them.
1. The experiments in Section 6 are conducted on a simple case. It would be desirable to evaluate the theory in more practical settings to confirm whether the assumptions and results hold for real-world problems. 2. In addition to data realism, empirical validation on more realistic architectures (e.g., deeper Transformers) would also strengthen the paper. 3. The paper focuses on optimizing population risk, without discussing how the finite-sample training loss landscape might differ or how gene
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
