On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability
Chenyu Zheng, Wei Huang, Rongzhen Wang, Guoqiang Wu, Jun Zhu,, Chongxuan Li

TL;DR
This paper investigates how autoregressive transformers may learn mesa-optimizers during training, analyzing the conditions under which they effectively implement gradient descent for linear models, and explores their limitations and capabilities.
Contribution
It provides a theoretical analysis of the non-convex training dynamics of autoregressive transformers, demonstrating conditions for mesa-optimizer emergence and their ability to recover data distributions.
Findings
Transformers learn a linear operator via gradient descent in certain data conditions.
The learned mesa-optimizer can recover the data distribution under specific moment conditions.
Simulations confirm the theoretical predictions about mesa-optimizer behavior.
Abstract
Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a mesa-optimizer during autoregressive (AR) pretraining to implement ICL. Namely, the forward pass of the trained transformer is equivalent to optimizing an inner objective function in-context. However, whether the practical non-convex training dynamics will converge to the ideal mesa-optimizer is still unclear. Towards filling this gap, we investigate the non-convex dynamics of a one-layer linear causal self-attention model autoregressively trained by gradient flow, where the sequences are generated by an AR process . First, under a certain condition of data distribution, we prove that an autoregressively trained transformer learns by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications
