Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning
Bryan Cheng, Jasper Zhang

TL;DR
This paper demonstrates that in-context learning task representations are distributed across multiple tokens, and single-position interventions cannot predict causal importance, challenging prior assumptions.
Contribution
It introduces the distributed template hypothesis, showing that task identity is encoded across output tokens and identifying the causal locus of in-context learning.
Findings
Single-position activation interventions fail to transfer task information.
Multi-position interventions achieve up to 96% transfer, pinpointing causal loci.
Task encoding is distributed across tokens, not localized.
Abstract
Understanding how large language models encode task identity from few-shot demonstrations is a central open problem in mechanistic interpretability. Prior work uses linear probing to localize task representations, reporting high classification accuracy at specific layers. We reveal a striking dissociation: probing accuracy completely fails to predict causal importance. Single-position activation intervention achieves 0% task transfer across all 28 layers of Llama-3.2-3B-despite 100% probing accuracy at those same positions. This null result is itself a key finding, demonstrating that task encoding is fundamentally distributed. Multi-position intervention-replacing activations at all demonstration output tokens simultaneously-achieves up to 96% transfer (N=50, 95% CI: [87%, 99%]) at layer 8, pinpointing for the first time the causal locus of ICL task identity. We establish the generality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
