TL;DR
This paper provides a theoretical framework showing that continuum transformers perform in-context operator learning via gradient descent in an operator RKHS, extending understanding of their capabilities to infinite-dimensional function inputs.
Contribution
It introduces a novel theoretical analysis of continuum transformers, demonstrating their in-context learning as gradient descent in an operator RKHS and establishing their optimality in the infinite depth limit.
Findings
Continuum transformers perform in-context operator learning via gradient descent in an operator RKHS.
The learned operator is the Bayes Optimal Predictor in the infinite depth limit.
Empirical results validate the theoretical optimality and parameter recovery.
Abstract
Transformers robustly exhibit the ability to perform in-context learning, whereby their predictive accuracy on a task can increase not by parameter updates but merely with the placement of training samples in their context windows. Recent works have shown that transformers achieve this by implementing gradient descent in their forward passes. Such results, however, are restricted to standard transformer architectures, which handle finite-dimensional inputs. In the space of PDE surrogate modeling, a generalization of transformers to handle infinite-dimensional function inputs, known as "continuum transformers," has been proposed and similarly observed to exhibit in-context learning. Despite impressive empirical performance, such in-context learning has yet to be theoretically characterized. We herein demonstrate that continuum transformers perform in-context operator learning by…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is novel and investigates an important question, probing the fundamentals of continuum Transformers for in-context operator learning. This is a problem that may be of interest to folks in the in-context learning theory and operator learning/PDE foundation model communities. - The paper is well-written and clear. The theoretical results are interesting and well done, and the authors validate their theoretical results empirically.
As far as I understand, the work does not provide any analysis of how many functional gradient descent steps / number of continuum Transformer layers it would take to achieve convergence on the in-context operator learning problem. In my opinion, the work as it stands already provides enough results to recommend acceptance, but I would be curious to know if such results are possible (see questions). Minor: - The authors could more clearly emphasize the significance and limitation of the theoret
- The paper presents a non-trivial extension of in context leaning to an operator valued, continuous, setting. The math is elegant have technical substance and can be used for further analysis. - Considering operator valued kernels for this problem and their representer theorem is elegant and makes defining each layer as one gradient descent step clean. The analysis for the fixed point pre training and the connection to BLUP is nice despite the strong assumptions. - The attention, the values a
- The Bayes optimality/BLUP limit require the data distribution to be $\kappa$-Gaussian and also an exact kernel match. In practice, the input functions come from non Gaussian distributions and the kernels might be misspecified. - Rotational symmetry and equivariance are strong assumptions. Correct me if I am wrong but I believe that a global invariance in whitened coordinates might not hold due to anisotropies or boundary conditions. - All the proofs are performed in the function space
- Originality: Extends the “transformer = gradient descent” theory from vector spaces to operator-valued RKHSs, a nontrivial and conceptually elegant generalization. - The mathematical treatment is rigorous and self-contained, including a generalized Representer Theorem for operator-valued kernels. - The results provide a clear mechanistic interpretation of in-context learning for continuum transformers. - Toy experiments (BLUP matching and operator-alignment studies) convincingly illustrate
- Lack of practical tokenization discussion. It remains unclear how tokenization is defined in PDE-learning settings. While a “next-token prediction” formulation is appropriate for fixed time-stepping tasks (e.g., 6-hour weather forecasts), such an in-context learning paradigm does not naturally extend to continuous-in-time scenarios where predictions are required at arbitrary temporal resolutions. Moreover, most PDE and computer-vision problems require some form of spatial tokenization to capt
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
