Bypassing the Exponential Dependency: Looped Transformers Efficiently   Learn In-context by Multi-step Gradient Descent

Bo Chen; Xiaoyu Li; Yingyu Liang; Zhenmei Shi; Zhao Song

arXiv:2410.11268·cs.LG·March 4, 2025·2 cites

Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent

Bo Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song

PDF

Open Access

TL;DR

This paper demonstrates that linear looped Transformers can efficiently perform multi-step gradient descent for in-context learning on linear tasks, requiring only a linear number of examples, unlike previous exponential requirements.

Contribution

It shows that linear looped Transformers can implement multi-step gradient descent efficiently with a linear number of in-context examples, improving understanding of their in-context learning capabilities.

Findings

01

Linear looped Transformers achieve small error with O(d) examples.

02

Theoretical analysis confirms efficient multi-step gradient descent implementation.

03

Preliminary experiments support the theoretical results.

Abstract

In-context learning has been recognized as a key factor in the success of Large Language Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in-context examples in the prompt during inference. Previous studies have demonstrated that the Transformer architecture used in LLMs can implement a single-step gradient descent update by processing in-context examples in a single forward pass. Recent work has further shown that, during in-context learning, a looped Transformer can implement multi-step gradient descent updates in forward passes. However, their theoretical results require an exponential number of in-context examples, $n = exp (Ω (T))$ , where $T$ is the number of loops or passes, to achieve a reasonably low error. In this paper, we study linear looped Transformers in-context learning on linear vector generation tasks. We show that linear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Dense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Linear Layer