Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression
Mingsong Yan, Dongyang Li, Charles Kulick, Sui Tang

TL;DR
This paper demonstrates that standard softmax-attention transformers can implement preconditioned Richardson iteration to solve Gaussian kernel ridge regression problems with provable accuracy, revealing a mechanistic understanding of in-context learning.
Contribution
It provides a theoretical construction showing transformers can perform convergent kernel regression via iterative algorithms, supported by empirical validation.
Findings
Transformers can approximate kernel ridge regression during forward pass.
Layer-wise error profiles align with preconditioned Richardson iteration.
Empirical results support the theoretical interpretation of the transformer mechanism.
Abstract
Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with blocks and MLP width …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
