TL;DR
This paper provides a theoretical framework linking in-context learning in transformers to kernel methods, especially for structured geometric data on manifolds, and analyzes their generalization capabilities.
Contribution
It establishes a novel connection between attention mechanisms and kernel methods, deriving error bounds and minimax rates for in-context learning on manifolds.
Findings
Transformers perform kernel-based predictions via attention on structured data.
Learned prompt-query scores correlate with Gaussian kernels for H"older functions.
Generalization error scales exponentially with prompt length, depending on manifold dimension.
Abstract
While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of H\"older functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for H\"older functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed,…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is well-written and generally easy to follow with novel contributions. - Lemma 1, where the authors explicitly construct a 5-block, multi-head transformer that exactly equals the Gaussian Nadaraya–Watson estimator on the prompt (no approximation error), is easy to follow and clearly makes the connection between the styles of methods and regression. -The task-/prompt-level generalization analysis is interesting and very relevant for the field. The decomposition isolates (a) learning
- The assumptions for the generalization error bound in Section 5 are quite idealised and strong. For example, the assumption that $p_x$ is uniform is unrealistic, but understandable for the work. A discussion of where this assumption might hold and where this framework may break down due to these assumptions would be helpful to add to the Appendix. - Although the paper's main contributions are the theoretical bounds and connection to kernel methods, the empirical experiments are fairly weak to
- This is an important direction which theoretically studies theoretical representational strengths of transformer-based ICL, and relates it to the role of geometry of the data. - The approximation bounds provided are tight. - This will have several scopes for future work, where learning techniques can be studies using similar methods to relate to the structure of the data. - The proofs are very nicely structured despite being long, which makes them easy to understand.
- Since the main purpose of the work is to provide a theoretical analysis, more emphasis should be given to the results and the proof techniques in the main part of the paper. Especially a proof/technique overview along with the first part of Section C would make more sense. - Section 3- Please define the input tokens more precisely, for e.g. we are given with (n+1) tokens where the first n tokens contain (x_i, y_i) and the last token contains x_{n+1}, we expect the final output to be present in
+ The paper aims to establish connection between attention scores and kernel estimators (Nadaraya-Watson). This yields a new perspective on Transformers as in-context kernel learners, which conceptually advances and generalizes beyond prior linear-model analyses. + The construction of Lemma 1 remarking that a Transformer network that exactly realizes kernel regression with zero approximation error is nontrivial. The explicit architectural specification further concretizes its theoretical claim.
- The theoretical results rely on exact kernel implementation via attention and perfect manifold sampling. These assumptions may obscure how robust the results remain under approximate or noisy conditions typical in practice. - Some key proof components, particularly how the softmax attention with masking produces the Gaussian weights in Eqs. (12) - (14) are deferred to appendices. The main text should outline at least the constructive steps more transparently for better readability. - While t
Videos
Taxonomy
TopicsTopological and Geometric Data Analysis · Child and Animal Learning Development · Advanced Graph Neural Networks
MethodsSoftmax · Attention Is All You Need
