Transformers Meet In-Context Learning: A Universal Approximation Theory
Gen Li, Yuchen Jiao, Yu Huang, Yuting Wei, Yuxin Chen

TL;DR
This paper presents a universal approximation theory for transformers, explaining how they enable in-context learning for a broad class of functions without weight updates, extending beyond convex optimization problems.
Contribution
It introduces a new theoretical framework combining Barron's approximation theory with the algorithm mimicking view to explain transformers' in-context learning capabilities.
Findings
Transformers can approximate a wide class of functions with small risk.
A transformer can be constructed to find linear representations similar to Lasso.
The theory extends beyond convex problems like linear regression.
Abstract
Large language models are capable of in-context learning, the ability to perform new tasks at test time using a handful of input-output examples, without parameter updates. We develop a universal approximation theory to elucidate how transformers enable in-context learning. For a general class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can predict based on a few noisy in-context examples with vanishingly small risk. Unlike prior work that frames transformers as approximators of optimization algorithms (e.g., gradient descent) for statistical learning tasks, we integrate Barron's universal function approximation theory with the algorithm approximator viewpoint. Our approach yields approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Machine Learning and Algorithms
