In-Context Learning Without Copying
Kerem Sahin, Sheridan Feucht, Adam Belfki, Jannik Brinkmann, Aaron Mueller, David Bau, Chris Wendler

TL;DR
This paper introduces Hapax, a training method that reduces reliance on induction heads for in-context learning, demonstrating that models can still learn abstractive tasks without copying patterns, challenging previous assumptions.
Contribution
Hapax shows that abstractive in-context learning can emerge independently of induction heads, reducing their role in model development.
Findings
Abstractive ICL capabilities are preserved despite reduced induction head activity.
Models trained with Hapax outperform vanilla models on most tasks.
Induction heads are not necessary for learning abstractive in-context tasks.
Abstract
Induction heads are attention heads that perform inductive copying by matching patterns from earlier context and copying their continuations verbatim. As models develop induction heads, they experience a sharp drop in training loss, a phenomenon cited as evidence that induction heads may underlie a wide range of in-context learning (ICL) capabilities. In this work, we investigate whether induction heads are a necessary building block for learning abstractive ICL capabilities (i.e., tasks where the answer is not contained in the input context), or whether such capabilities can emerge independently. We propose Hapax, a training regime that omits the loss contribution of tokens predictable by induction heads. Despite a significant reduction in inductive copying, abstractive ICL capabilities are preserved, with the model achieving higher accuracy than the vanilla model on 13 out of 21…
Peer Reviews
Decision·Submitted to ICLR 2026
**Interesting empirical perspective.** The paper introduces a creative intervention—masking losses on repeated n-grams—to examine how removing copy-related learning signals affects in-context learning. This approach provides a valuable window into how specific training signals shape internal circuits, offering an original empirical angle rather than a purely conceptual contribution. - **Clear experimental setup.** The masking mechanism and evaluation procedure are defined precisely
### 1. Conceptual framing may mislead rather than innovate The central claim—that *inductive copying is not essential for ICL*—could be interpreted as overturning a previously dominant view that induction heads are the sole foundation of ICL. In practice, the field already recognizes that induction heads support certain ICL behaviors but are not the only mechanism. Thus, the framing could **unintentionally give the impression** that the paper refutes a consensus that did not exist. The gen
The paper’s originality lies in directly challenging one of the most established causal hypotheses about ICL. Rather than building another interpretability tool, the authors use a simple yet powerful experimental manipulation to test whether transformers can still learn ICL without explicit copying. This approach transforms a long-standing correlational observation into a causal experiment. The outcome is both surprising and illuminating: models deprived of copying signals still learn and someti
The scope of empirical validation is somewhat limited. Experiments use 1B-parameter GPT-NeoX models trained on The Pile, which is appropriate for controlled analysis but smaller than models where ICL phenomena are most pronounced. It remains unclear whether the same findings generalize to multi-billion-parameter transformers or to multi-layer SAEs and circuit configurations used in interpretability research. The masking strategy targets repeated n-grams, which suppresses literal copying but not
- The authors introduce a novel framework to investigate ICL in a setting where induction head formation is discouraged - I believe the question of what transformers can learn when they are disallowed from forming induction heads is interesting and may lead to the discovery of important transformer circuits - The Hapax scheme the authors propose is an interesting lens into the formation of induction heads and what signals are required for such circuits to form, I encourage the authors to contin
- It is not clear to me that input embedding similarity is the right way to resolve repetitions in text that are tokenized differently. - Tokens may have similar embeddings despite representing distinct strings - It is not clear to me (even after reading appendix B) that the threshold chosen by the authors ($\tau = 0.3$) effectively suppresses repetitions in natural text. - I cannot find experiments testing the effect of $\tau\not=0.3$, or experiments that show that embedding c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Ferroelectric and Negative Capacitance Devices
