Provable Low-Frequency Bias of In-Context Learning of Representations

Yongyi Yang; Hidenori Tanaka; Wei Hu

arXiv:2507.13540·cs.LG·July 31, 2025

Provable Low-Frequency Bias of In-Context Learning of Representations

Yongyi Yang, Hidenori Tanaka, Wei Hu

PDF

3 Reviews

TL;DR

This paper provides a theoretical explanation for in-context learning in large language models, showing it biases representations towards low-frequency, smooth structures, and demonstrates robustness to high-frequency noise.

Contribution

It introduces a unified double convergence framework explaining ICL's low-frequency bias and validates this with analytical proofs and empirical experiments.

Findings

01

Representations converge over context and layers, leading to low-frequency bias.

02

ICL representations are robust against high-frequency noise.

03

Total energy of representations decays without vanishing.

Abstract

In-context learning (ICL) enables large language models (LLMs) to acquire new behaviors from the input sequence alone without any parameter updates. Recent studies have shown that ICL can surpass the original meaning learned in pretraining stage through internalizing the structure the data-generating process (DGP) of the prompt into the hidden representations. However, the mechanisms by which LLMs achieve this ability is left open. In this paper, we present the first rigorous explanation of such phenomena by introducing a unified framework of double convergence, where hidden representations converge both over context and across layers. This double convergence process leads to an implicit bias towards smooth (low-frequency) representations, which we prove analytically and verify empirically. Our theory explains several open empirical observations, including why learned representations…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

1. This studied topic of research is quite relevant for the community. The paper provides a sound theoretical framework to explain the ICLR phenomenon proposing the Double convergence process. Following the defined theoretical framework, it replicates empirical evidence from Park et al. 2024 using the simplified transformer model and further provides new insightful empirical analysis. 2. The detailed energy-decay analysis in hidden representations across layers is novel and support the low-freq

Weaknesses

1. The empirical work shown in this paper is limited to a single DGP process, which is same as Park et al. 2024. Adding experiments on additional DGP process would strengthen evidence for the proposed framework. 2. It would be insightful to see how the model behaves to higher high-frequency noise in input sequence than just 1% in Figure 4/Section 5.5 experiment.

Reviewer 02Rating 6Confidence 2

Strengths

Originality: The paper provides the first rigorous theoretical framework linking ICL dynamics to low-frequency bias and graph-spectral smoothness, a novel perspective that unifies several previously disconnected empirical phenomena (representation alignment, energy decay, noise robustness). The “double convergence” view is both conceptually clean and mathematically tractable. Quality and clarity: The work offers formal proofs (Theorems 1–2, Lemmas 6–8) establishing convergence guarantees under

Weaknesses

1. The analysis assumes that attention maps are “balanced” and externally fixed, depending only on token identity rather than learned representations. While this enables tractability, it significantly limits realism—modern Transformer attention depends on contextual interactions. The authors partially justify this with empirical evidence (“covers 70% of connections”), but the assumption still weakens the generality of the claims. 2. The main theorems are proved under a specific DGP—a random wal

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper’s theoretical foundation is both conceptually original and mathematically rigorous. The **double convergence**—one with respect to sequence length and another with respect to layer depth—neatly characterizes the representation learning process observed in the original ICLR paper and integrates it (at least up to some simplifications) into the actual forward computation of the model. The authors astutely identify the natural connection between the stationary distribution and transiti

Weaknesses

1. While I acknowledge the authors’ justification in lines 170–178, the settings and assumptions underlying this paper still seem overly restrictive. To obtain context-wise convergence independent of token semantics, the authors assume a single-head attention layer and fixed attention weights—conditions that are far removed from the actual configurations of modern attention modules. Although they empirically validate the factorization of attention weights into four basic forms in Appendix G, whi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.