TL;DR
This paper provides a theoretical explanation for in-context learning in large language models, showing it biases representations towards low-frequency, smooth structures, and demonstrates robustness to high-frequency noise.
Contribution
It introduces a unified double convergence framework explaining ICL's low-frequency bias and validates this with analytical proofs and empirical experiments.
Findings
Representations converge over context and layers, leading to low-frequency bias.
ICL representations are robust against high-frequency noise.
Total energy of representations decays without vanishing.
Abstract
In-context learning (ICL) enables large language models (LLMs) to acquire new behaviors from the input sequence alone without any parameter updates. Recent studies have shown that ICL can surpass the original meaning learned in pretraining stage through internalizing the structure the data-generating process (DGP) of the prompt into the hidden representations. However, the mechanisms by which LLMs achieve this ability is left open. In this paper, we present the first rigorous explanation of such phenomena by introducing a unified framework of double convergence, where hidden representations converge both over context and across layers. This double convergence process leads to an implicit bias towards smooth (low-frequency) representations, which we prove analytically and verify empirically. Our theory explains several open empirical observations, including why learned representations…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This studied topic of research is quite relevant for the community. The paper provides a sound theoretical framework to explain the ICLR phenomenon proposing the Double convergence process. Following the defined theoretical framework, it replicates empirical evidence from Park et al. 2024 using the simplified transformer model and further provides new insightful empirical analysis. 2. The detailed energy-decay analysis in hidden representations across layers is novel and support the low-freq
1. The empirical work shown in this paper is limited to a single DGP process, which is same as Park et al. 2024. Adding experiments on additional DGP process would strengthen evidence for the proposed framework. 2. It would be insightful to see how the model behaves to higher high-frequency noise in input sequence than just 1% in Figure 4/Section 5.5 experiment.
Originality: The paper provides the first rigorous theoretical framework linking ICL dynamics to low-frequency bias and graph-spectral smoothness, a novel perspective that unifies several previously disconnected empirical phenomena (representation alignment, energy decay, noise robustness). The “double convergence” view is both conceptually clean and mathematically tractable. Quality and clarity: The work offers formal proofs (Theorems 1–2, Lemmas 6–8) establishing convergence guarantees under
1. The analysis assumes that attention maps are “balanced” and externally fixed, depending only on token identity rather than learned representations. While this enables tractability, it significantly limits realism—modern Transformer attention depends on contextual interactions. The authors partially justify this with empirical evidence (“covers 70% of connections”), but the assumption still weakens the generality of the claims. 2. The main theorems are proved under a specific DGP—a random wal
1. The paper’s theoretical foundation is both conceptually original and mathematically rigorous. The **double convergence**—one with respect to sequence length and another with respect to layer depth—neatly characterizes the representation learning process observed in the original ICLR paper and integrates it (at least up to some simplifications) into the actual forward computation of the model. The authors astutely identify the natural connection between the stationary distribution and transiti
1. While I acknowledge the authors’ justification in lines 170–178, the settings and assumptions underlying this paper still seem overly restrictive. To obtain context-wise convergence independent of token semantics, the authors assume a single-head attention layer and fixed attention weights—conditions that are far removed from the actual configurations of modern attention modules. Although they empirically validate the factorization of attention weights into four basic forms in Appendix G, whi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
