Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective
Soo Min Kwon, Alec S. Xu, Can Yaras, Laura Balzano, Qing Qu

TL;DR
This paper provides a theoretical framework for understanding when in-context learning (ICL) can generalize out-of-distribution in transformers, highlighting the role of task distribution structure and subspace geometry.
Contribution
It introduces a minimal mathematical model that characterizes OOD generalization of ICL, revealing conditions based on task subspace arrangements and extending findings to nonlinear models.
Findings
Transformers can generalize to all angles if pre-training tasks are from a union of subspaces.
Single Gaussian task distributions limit ICL's OOD generalization, depending on the angle.
Empirical results support the theoretical predictions on GPT-2 and nonlinear models.
Abstract
The transformer's remarkable ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its strengths and limitations. However, a theoretical understanding of when ICL can and cannot generalize beyond its pre-training data still remains unclear. This paper puts forth a minimal mathematical model that provably identifies when ICL can generalize out-of-distribution (OOD). By studying linear regression tasks parameterized with low-rank covariance matrices, we model distribution shifts as varying angles between subspaces and derive conditions under which a single-layer linear attention model interpolates across all angles. We show that if pre-training task vectors are drawn from a union of subspaces, transformers can generalize to all angle shifts--enabling ICL even in regions with zero probability mass in the training distribution. On the other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
