Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

Haoren Xu; Guanhua Fang

arXiv:2605.10466·cs.LG·May 12, 2026

Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

Haoren Xu, Guanhua Fang

PDF

TL;DR

This paper reveals that large language models' behaviors like in-context learning and repetition can be explained by a covariance-based readout mechanism within the attention process, unifying these phenomena under a common theoretical framework.

Contribution

It demonstrates that attention mechanisms inherently perform a covariance readout, explaining in-context learning and repetition as emergent properties of this process.

Findings

01

Attention output converges to a linear covariance readout of input statistics.

02

Single attention heads can perform gradient descent steps in linear regression.

03

Repetition and mode collapse are explained as asymptotic behaviors of the covariance readout.

Abstract

Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $Θ_{V} Σ Θ_{K}^{⊤} Θ_{Q} x_{t}$ , where $Σ$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.