TL;DR
This paper provides a theoretical analysis of in-context learning in pretrained Transformers with nonlinear MLP heads, revealing how data mixing and source quality influence performance, and validates findings with empirical experiments including multilingual sentiment analysis.
Contribution
It introduces a high-dimensional asymptotic equivalence for Transformers with MLPs, extending theoretical understanding of ICL beyond simplified models and analyzing the impact of data source properties.
Findings
Nonlinear MLPs significantly improve ICL on nonlinear tasks.
Data quality and structure critically affect feature learning and ICL performance.
Empirical validation across various models and real-world multilingual data confirms theoretical insights.
Abstract
Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), plain data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
