Provable In-Context Learning of Nonlinear Regression with Transformers

Hongbo Li; Lingjie Duan; Yingbin Liang

arXiv:2507.20443·cs.LG·October 2, 2025

Provable In-Context Learning of Nonlinear Regression with Transformers

Hongbo Li, Lingjie Duan, Yingbin Liang

PDF

3 Reviews

TL;DR

This paper provides a theoretical analysis of how transformer models learn in-context nonlinear regression tasks, revealing the dynamics of attention and conditions for successful learning based on the Lipschitz constant of target functions.

Contribution

It introduces new proof techniques to analyze attention dynamics in nonlinear regression, establishing how Lipschitz constants influence convergence and in-context learning capabilities.

Findings

01

Attention scores grow rapidly early and converge to one

02

Lipschitz constant L governs convergence speed

03

Transformers attend to relevant features at convergence

Abstract

The transformer architecture, which processes sequences of input tokens to produce outputs for query tokens, has revolutionized numerous areas of machine learning. A defining feature of transformers is their ability to perform previously unseen tasks using task specific prompts without updating parameters, a phenomenon known as in-context learning (ICL). Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks such as linear regression and binary classification. To advance the theoretical understanding of ICL, this paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities in these settings. We analyze the stage-wise dynamics of attention during training: attention scores between a query token and its target features grow rapidly in the early…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper studies the convergence of transformers during the pretraining stage, which is an interesting topic. 2. The discovery of the sharp and flat regimes, along with the two-phase transition observed in both the training loss and the attention scores during training, is particularly intriguing. 3. Experimental results show the two-phase transition of the pretraining stage, which coincides with the theory.

Weaknesses

1. The paper focuses only on a single-layer Transformer, and extending the analysis to a multi-layer setting would provide valuable insights. 2. The technical approach appears to overlap with that of [1], which somewhat diminishes the paper’s original contribution. [1] Huang, Y., Cheng, Y., & Liang, Y. (2023). In-context convergence of transformers. arXiv preprint arXiv:2310.05249.

Reviewer 02Rating 4Confidence 4

Strengths

This is a training dynamics analysis, as opposed to characterization of global OPT. The paper is clear to read.

Weaknesses

Features come from a discrete set? The optimal attention pattern is to attend to only exactly identical vectors.

Reviewer 03Rating 6Confidence 4

Strengths

- The paper removes strong assumptions made in prior work (such as linearity or orthogonal feature bases) and establishes theory for a broader class of problems. - The comparison with previous studies is clear. Existing results on optimization in in-context learning are well summarized, and the paper clearly explains how its contributions differ from them.

Weaknesses

1. The experiments in Section 6 are conducted on a simple case. It would be desirable to evaluate the theory in more practical settings to confirm whether the assumptions and results hold for real-world problems. 2. In addition to data realism, empirical validation on more realistic architectures (e.g., deeper Transformers) would also strengthen the paper. 3. The paper focuses on optimizing population risk, without discussing how the finite-sample training loss landscape might differ or how gene

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.