Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
Chenyang Zhang, Yuan Cao

TL;DR
This paper demonstrates how transformers can perform in-context logistic regression by executing normalized gradient descent steps, providing theoretical insights into their in-context learning capabilities.
Contribution
It constructs a class of transformers that perform in-context logistic regression through gradient descent, with training and generalization guarantees.
Findings
Transformers can perform exact in-context logistic regression.
A single self-attention layer trained via gradient descent suffices.
The looped model generalizes out-of-distribution.
Abstract
Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regression, with each layer exactly performing one step of normalized gradient descent on an in-context loss. Then, we show that our constructed transformer can be obtained through (i) training a single self-attention layer supervised by one-step gradient descent, and (ii) recurrently applying the trained layer to obtain a looped model. Training convergence guarantees of the self-attention layer and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
