Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Haoren Xu, Guanhua Fang

TL;DR
This paper reveals that large language models' behaviors like in-context learning and repetition can be explained by a covariance-based readout mechanism within the attention process, unifying these phenomena under a common theoretical framework.
Contribution
It demonstrates that attention mechanisms inherently perform a covariance readout, explaining in-context learning and repetition as emergent properties of this process.
Findings
Attention output converges to a linear covariance readout of input statistics.
Single attention heads can perform gradient descent steps in linear regression.
Repetition and mode collapse are explained as asymptotic behaviors of the covariance readout.
Abstract
Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to , where is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
