Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence
Bofan Gong, Shiyang Lai, James Evans, Dawn Song

TL;DR
This paper uncovers systematic interference patterns in language models caused by polysemantic representations, demonstrating that these patterns transfer across models and can be used for model control and understanding.
Contribution
It introduces a method using sparse autoencoders to map polysemantic structures and shows that interference patterns are transferable across different models and scales.
Findings
Interference patterns are shared between small and large models.
Interventions based on interference patterns can predict behavioral shifts.
Polysemantic interference structures are systematic and not purely stochastic.
Abstract
Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control. Leveraging sparse autoencoders (SAEs), we map the polysemantic topology of two small models (Pythia-70M and GPT-2-Small) to identify SAE feature pairs that are semantically unrelated yet exhibit interference within models. We intervene at four foci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models. Critically, interventions distilled from counterintuitive interference patterns shared by two small models transfer reliably to larger instruction-tuned models (Llama-3.1-8B/70B-Instruct and Gemma-2-9B-Instruct), yielding predictable behavioral shifts without access to model internals. These findings challenge the view that…
Peer Reviews
Decision·ICLR 2026 Poster
I really like this paper. It needs some improvement on presentation / writing to ensure results are easily legible, but conditioning on that minimal rewrite, the empirical characterization of interference pathologies (such as the super-neurons), the transfer of these pathologies across model scales, and the exhaustiveness of the experiments is awesome. To be clear, I believe these results were relatively expected, but getting a detailed account of them in a single paper is very helpful for the c
- Missing descriptions of relevant concepts: I tried to find the precise mathematical definition for how interference is defined or measured, but didn't see it anywhere in the paper. This really impacted my ability to onboard with the paper on a first reading pass, but eventually digging through the appendix and looking at results, assuming there's a reasonable mathematical definition for the notion of interference, I like this paper's goals and results. If the authors can fix these definitions
- Well-motivated problem setup, and grounded in real world impacts of polysemantic features and vulnerability of black-box models to adverserial attacks. - Experiments and methods are comprehensive, thorough, and well-documented. - Interesting empirical results with clear impact, especially the transfer between models which are tiny by modern standards and full-size 70B models. The super-neuron results are also interesting. - Generally well-written.
- One of the greatest strengths of this paper is the result with feature transfer across model scale. However, the effect size of these seems very small at only a few % points in most cases (table 1), which the authors don't seem to make note of. If this is right, this seems like an important limitation of this work which should be clear to the reader, although it also may diminish the significance of this work somewhat. - Relatedly, some of the interventions in figure 3 seem to have very modes
- The paper attempts to connect feature-level correlations to behavioral effects and cross-model transfer --- an ambitious angle. - The idea of probing “interference” across different loci (feature, gradient, prompt, neuron) is original and could, in principle, yield insights into distributed representations.
Unclear necessity of SAEs: It is never justified why SAEs are needed for identifying interference. The same cosine-overlap analysis could be performed directly in the model’s activation or embedding space, or between token clusters in the vocabulary embedding. By relying on pre-computed SAE features (and their textual glosses), the analysis may inherit annotation noise and input-centric biases (see Arad et al., SAEs Are Good for Steering - If You Select the Right Features, 2025). Definition: Th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpinion Dynamics and Social Influence
