On the Role of Depth and Looping for In-Context Learning with Task Diversity
Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka,, Sanjiv Kumar

TL;DR
This paper investigates how depth and looping mechanisms in Transformers influence their ability to perform in-context learning across diverse tasks, revealing trade-offs between expressivity and robustness, and proposing looping as a robust alternative.
Contribution
It introduces theoretical bounds for depth in Transformers for task diversity, compares multilayer and looped Transformers, and demonstrates looping's robustness and monotonic loss behavior.
Findings
Deeper Transformers require more layers to solve diverse tasks.
Multilayer Transformers lack robustness to small distributional shifts.
Looped Transformers are both expressive and robust, with monotonic loss behavior.
Abstract
The intriguing in-context learning (ICL) abilities of deep Transformer models have lately garnered significant attention. By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers' abilities to simulate learning algorithms like gradient descent. However, these works fail to capture the remarkable ability of Transformers to learn multiple tasks in context. To this end, we study in-context learning for linear regression with diverse tasks, characterized by data covariance matrices with condition numbers ranging from , and highlight the importance of depth in this setting. More specifically, (a) we show theoretical lower bounds of (or ) linear attention layers in the unrestricted (or restricted) attention setting and, (b) we show that multilayer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation Methods and Practices · Cognitive and developmental aspects of mathematical skills · Spatial Cognition and Navigation
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding · Absolute Position Encodings · Multi-Head Attention · Softmax
