Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent
Bo Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song

TL;DR
This paper demonstrates that linear looped Transformers can efficiently perform multi-step gradient descent for in-context learning on linear tasks, requiring only a linear number of examples, unlike previous exponential requirements.
Contribution
It shows that linear looped Transformers can implement multi-step gradient descent efficiently with a linear number of in-context examples, improving understanding of their in-context learning capabilities.
Findings
Linear looped Transformers achieve small error with O(d) examples.
Theoretical analysis confirms efficient multi-step gradient descent implementation.
Preliminary experiments support the theoretical results.
Abstract
In-context learning has been recognized as a key factor in the success of Large Language Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in-context examples in the prompt during inference. Previous studies have demonstrated that the Transformer architecture used in LLMs can implement a single-step gradient descent update by processing in-context examples in a single forward pass. Recent work has further shown that, during in-context learning, a looped Transformer can implement multi-step gradient descent updates in forward passes. However, their theoretical results require an exponential number of in-context examples, , where is the number of loops or passes, to achieve a reasonably low error. In this paper, we study linear looped Transformers in-context learning on linear vector generation tasks. We show that linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Dense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Linear Layer
