Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka,, Sanjiv Kumar

TL;DR
This paper demonstrates that multi-layer, looped Transformers can learn to implement multi-step gradient descent algorithms for in-context linear regression, showing convergence and data-adaptive preconditioning through theoretical analysis and experiments.
Contribution
It provides the first theoretical analysis showing that looped Transformers can learn and converge to multi-step gradient descent algorithms in linear regression tasks.
Findings
Transformers can implement multi-step preconditioned gradient descent.
Gradient flow converges despite non-convex landscape.
Theoretical validation through synthetic experiments.
Abstract
The remarkable capability of Transformers to do reasoning and few-shot learning, without any fine-tuning, is widely conjectured to stem from their ability to implicitly simulate a multi-step algorithms -- such as gradient descent -- with their weights in a single forward pass. Recently, there has been progress in understanding this complex phenomenon from an expressivity point of view, by demonstrating that Transformers can express such multi-step algorithms. However, our knowledge about the more fundamental aspect of its learnability, beyond single layer models, is very limited. In particular, can training Transformers enable convergence to algorithmic solutions? In this work we resolve this for in-context linear regression with linear looped Transformers -- a multi-layer model with weight sharing that is conjectured to have an inductive bias to learn fix-point iterative algorithms.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExperimental Learning in Engineering · Intelligent Tutoring Systems and Adaptive Learning · Learning Styles and Cognitive Differences
MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding
