Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond
Yingcong Li, Ankit Singh Rawat, Samet Oymak

TL;DR
This paper provides a detailed analysis of in-context learning in Transformers with linear attention, exploring architectures, low-rank parameterization, and correlated designs to understand their optimization landscapes and generalization capabilities.
Contribution
It offers new theoretical insights into the optimization landscape of linear attention models, including risk bounds and the effects of low-rank parameterization, supported by experimental validation.
Findings
H3 model implements 1-step preconditioned gradient descent.
Sample complexity benefits from distributional alignment.
LoRA adapts to distribution shifts by capturing covariance changes.
Abstract
Recent research has shown that Transformers with linear attention are capable of in-context learning (ICL) by implementing a linear estimator through gradient descent steps. However, the existing results on the optimization landscape apply under stylized settings where task and feature vectors are assumed to be IID and the attention weights are fully parameterized. In this work, we develop a stronger characterization of the optimization and generalization landscape of ICL through contributions on architectures, low-rank parameterization, and correlated designs: (1) We study the landscape of 1-layer linear attention and 1-layer H3, a state-space model. Under a suitable correlated design assumption, we prove that both implement 1-step preconditioned gradient descent. We show that thanks to its native convolution filters, H3 also has the advantage of implementing sample weighting and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
MethodsSoftmax · Attention Is All You Need · Convolution
