Optimal In-context Adaptivity and Distributional Robustness of Transformers
Tianyi Ma, Tengyao Wang, Richard J. Samworth

TL;DR
This paper analyzes how pretrained Transformers perform on tasks with distribution shifts, demonstrating they adapt optimally to task difficulty and are robust to distributional changes, with theoretical guarantees.
Contribution
It provides a theoretical framework showing pretrained Transformers achieve optimal convergence rates under distribution shifts, outperforming traditional minimax bounds.
Findings
Transformers pretrained on sufficient data adapt to task difficulty levels.
They maintain optimal convergence rates within chi-squared divergence bounds.
Pretrained Transformers outperform estimators with access to test distributions.
Abstract
We study in-context learning problems where a Transformer is pretrained on tasks drawn from a mixture distribution , called the pretraining prior, in which each mixture component is a distribution on tasks of a specific difficulty level indexed by . Our goal is to understand the performance of the pretrained Transformer when evaluated on a different test distribution , consisting of tasks of fixed difficulty , and with potential distribution shift relative to , subject to the chi-squared divergence being at most . In particular, we consider nonparametric regression problems with random smoothness, and multi-index models with both random smoothness and random effective dimension. We prove that a large Transformer pretrained on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
