The Initialization Determines Whether In-Context Learning Is Gradient Descent
Shifeng Xie, Rui Yuan, Simone Rossi, Thomas Hannagan

TL;DR
This paper investigates how multi-head self-attention in large language models approximates gradient descent in in-context learning, introduces a trainable initial guess to improve performance, and validates these findings through theoretical analysis and experiments.
Contribution
It extends the understanding of in-context learning by analyzing multi-head self-attention with non-zero priors, introduces yq-LSA with trainable initial guesses, and demonstrates improved performance on regression and semantic tasks.
Findings
Multi-head LSA approximates GD under realistic conditions.
Introducing a trainable initial guess improves ICL performance.
Theoretical bounds on the number of attention heads needed.
Abstract
In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
