The Initialization Determines Whether In-Context Learning Is Gradient Descent

Shifeng Xie; Rui Yuan; Simone Rossi; Thomas Hannagan

arXiv:2512.04268·cs.LG·December 5, 2025

The Initialization Determines Whether In-Context Learning Is Gradient Descent

Shifeng Xie, Rui Yuan, Simone Rossi, Thomas Hannagan

PDF

Open Access

TL;DR

This paper investigates how multi-head self-attention in large language models approximates gradient descent in in-context learning, introduces a trainable initial guess to improve performance, and validates these findings through theoretical analysis and experiments.

Contribution

It extends the understanding of in-context learning by analyzing multi-head self-attention with non-zero priors, introduces yq-LSA with trainable initial guesses, and demonstrates improved performance on regression and semantic tasks.

Findings

01

Multi-head LSA approximates GD under realistic conditions.

02

Introducing a trainable initial guess improves ICL performance.

03

Theoretical bounds on the number of attention heads needed.

Abstract

In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks