Understanding In-context Learning of Addition via Activation Subspaces
Xinyan Hu, Kayo Yin, Michael I. Jordan, Jacob Steinhardt, Lijie Chen

TL;DR
This paper investigates how transformer language models perform in in-context learning of addition, revealing that a few attention heads with low-dimensional subspaces encode the addition process, and introduces methods to analyze these mechanisms.
Contribution
The paper introduces a novel optimization and analysis framework that localizes and interprets the in-context learning mechanism in transformer models, especially focusing on attention head subspaces.
Findings
Few attention heads encode addition in low-dimensional subspaces.
Identified a self-correction mechanism in the model's in-context learning.
Reduced model complexity to three heads with interpretable subspaces.
Abstract
To perform few-shot learning, language models extract signals from a few input-label pairs, aggregate these into a learned prediction rule, and apply this rule to new inputs. How is this implemented in the forward pass of modern transformer models? To explore this question, we study a structured family of few-shot learning tasks for which the true prediction rule is to add an integer to the input. We introduce a novel optimization method that localizes the model's few-shot ability to only a few attention heads. We then perform an in-depth analysis of individual heads, via dimensionality reduction and decomposition. As an example, on Llama-3-8B-instruct, we reduce its mechanism on our tasks to just three attention heads with six-dimensional subspaces, where four dimensions track the unit digit with trigonometric functions at periods , , and , and two dimensions track…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper provides a detailed analysis of how ICL emerges in transformers, moving beyond descriptive observations to a mechanistic understanding of specific heads and subspaces. 2. The authors identify a very small subset of attention heads responsible for ICL, demonstrating that task-specific behavior can be localized within a large network. 3. Intervention experiments strengthen the causal claims about which heads and subspaces are responsible for ICL. 4. The paper connects abstract mechani
1. The analysis is restricted to synthetic add‑k tasks, which may not generalize to more complex or natural ICL tasks such as language understanding or reasoning. 2. The paper localizes ICL to attention heads but largely ignores contributions from feed-forward networks (FFNs) [1] or other layers, leaving a partial picture of the mechanism. 3. The projection of head outputs into low-dimensional trigonometric subspaces assumes well-behaved linear relationships, which may not hold in more complex o
* The authors present a well motivated study of how LLMs learn to perform a single task in-context. * The study is extremely in-depth and thorough. * The authors present clear evidence for their model, according to which a few heads represent the parity, unit digit and magnitude of $k$. * The authors also present a useful method for finding the heads that are used by a model to solve a task in-context.
* The paper is pretty dense to read, some of the explanations in the text could have accompanying figures. This holds especially for section 3, 4 and 5 which do not contain much in terms of figures. * Not really a major weakness, but the paper only covers one task. While the analyses of how the model solves this tasks is very detailed, it's not obvious how these insights will generalize to how ICL may work in more general setups. For instance, do the circuits analyzed here also cover $k$-subtrac
The methodology is precise and reproducible (seemingly), combining causal interventions with low-dimensional analysis rather than relying on correlations. The discovery that only three heads encode nearly all ICL function is striking and empirically well supported. The identification of a structured six-dimensional subspace gives a clear, interpretable geometry to addition in LLMs. The extractor-aggregator relation and observed self-correction behavior offer new insight into how contextual infor
- I disagree with the discussion in 132-138. "likely output" in my understanding is two words belong to similar topic, and thus would have closer semantic relationship. Since $x_q$ and $k$ are both numbers, they would be also semantically close than $x_q$ and singer. - Your activation patching is similar to the treatement of the study of task vector arithematic in factual recall task as in Merullo et al. (2024) leveraging task vector, please cast a comparison. - Your locolization optimization me
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms
MethodsSoftmax · Attention Is All You Need
