How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?
Hongkang Li, Meng Wang, Songtao Lu, Xiaodong Cui, Pin-Yu Chen

TL;DR
This paper provides the first theoretical analysis of how nonlinear Transformers learn and generalize in in-context learning, examining training dynamics, generalization capacity, component contributions, and pruning effects.
Contribution
It offers novel theoretical insights into the training and generalization mechanisms of nonlinear Transformers in in-context learning, including effects of pruning.
Findings
Transformers trained on some tasks can generalize to unseen tasks in ICL.
Proper pruning minimally impacts ICL performance while reducing inference costs.
Theoretical analysis is supported by numerical experiments.
Abstract
Transformer-based large language models have displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply augmenting the query with some input-output examples from that task. Despite the empirical success, the mechanics of how to train a Transformer to achieve ICL and the corresponding ICL capacity is mostly elusive due to the technical challenges of analyzing the nonconvex training problems resulting from the nonlinear self-attention and nonlinear activation in Transformers. To the best of our knowledge, this paper provides the first theoretical analysis of the training dynamics of Transformers with nonlinear self-attention and nonlinear MLP, together with the ICL generalization capability of the resulting model. Focusing on a group of binary classification tasks, we train Transformers using data from a subset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsLinear Layer · Byte Pair Encoding · Dropout · Pruning · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Layer Normalization
