Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Max McGuinness; Alex Serrano; Luke Bailey; Scott Emmons

arXiv:2512.11949·cs.LG·December 16, 2025

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Max McGuinness, Alex Serrano, Luke Bailey, Scott Emmons

PDF

Open Access 4 Models 5 Reviews

TL;DR

This paper demonstrates that fine-tuned language models can learn to evade unseen activation monitors by manipulating internal states, highlighting a potential vulnerability in AI safety measures.

Contribution

It introduces Neural Chameleons, models trained to hide their internal states from monitors, and provides a stress-test for the robustness of activation monitoring techniques.

Findings

01

Models can evade unseen activation monitors zero-shot.

02

Evasion is concept-specific and minimally impacts standard capabilities.

03

Defenses like ensemble monitors are more resilient but not foolproof.

Abstract

Activation monitoring, which probes a model's internal states using lightweight classifiers, is an emerging tool for AI safety. However, its worst-case robustness under a misalignment threat model--where a model might learn to actively conceal its internal states--remains untested. Focusing on this threat model, we ask: could a model learn to evade previously unseen activation monitors? Our core contribution is to stress-test the learnability of this behavior. We demonstrate that finetuning can create Neural Chameleons: models capable of zero-shot evading activation monitors. Specifically, we fine-tune an LLM to evade monitors for a set of benign concepts (e.g., languages, HTML) when conditioned on a trigger of the form: "You are being probed for {concept}". We show that this learned mechanism generalizes zero-shot: by substituting {concept} with a safety-relevant term like 'deception',…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 4Confidence 3

Strengths

The notion that an LLM could strategically modify internal activations to avoid detection by unseen monitors is a genuinely new and important safety finding. The use of the model organism framework, clear separation between training and evaluation probes, and post hoc probe training ensure that the observed obfuscation cannot be dismissed as data leakage or trivial overfitting.

Weaknesses

The experiments center primarily on Gemma-2-9b-it. While this choice is justified for controlled testing, evaluating whether the same phenomenon arises in larger-scale or architecturally different models (e.g., LLaMA-3, Mistral, or GPT-J) Although the paper references PCA and probe alignment, the connection to existing mechanistic frameworks such as superposition theory or representation engineering (Zou et al., 2023) could be deepened The paper uses explicit trigger phrases (“You are being pr

Reviewer 02Rating 4Confidence 4

Strengths

The mechanistic analysis provides valuable insights into how models can implement activation manipulation (low-rank perturbations, semantic steering via embedding similarity). The combination of backdoor-style training with systematic evaluation against multiple probe architectures and post-hoc training represents a useful methodological contribution. Paper is well-written with clear motivation and methodology Excellent figures that effectively communicate the core findings

Weaknesses

**The core finding is less surprising than framed** The paper demonstrates that a model with an embedded backdoor can evade probes trained on clean data. This is essentially: - Training phase: Model learns "when trigger present, perturb activations to evade 11 benign probes" - Test phase: New probes trained on data WITHOUT trigger present - Result: Model with active backdoor evades these new probes This is expected behavior for any backdoor attack with semantic generalization capabilities. The

Reviewer 03Rating 4Confidence 3

Strengths

1) The paper is the first to show that LLMs can perform self-obfuscation, which is a timely contribution towards stress-testing AI safety through probes. 2) The experimental setup is thorough and the paper is well written.

Weaknesses

1) The self-obfuscation capability is deliberately engineered into the model using an explicit trigger and fine-tuning, but the paper does not mention whether this would show up naturally during post training or safety alignment. 2) The experiments are confined to a single model, Gemma-2-9b-it. Thus, the experimental evidence for self-obfuscation extending to models beyond this class is limited. Additionally, the internal activations of models with substantially different architectures, such as

Reviewer 04Rating 4Confidence 5

Strengths

1. Empirically shows that language models can learn the general ability to obfuscate their activations, which is quite an important finding. 2. The paper is, for the most part, well-written and easy to follow. 3. The paper shows how the capability of the model changes under the trigger and with different languages, which makes their findings and model organism more concrete.

Weaknesses

1. The work does not have any ablation studies on train time topics. It seems important (and relatively straightforward) to characterise how many and how diverse the train time topics should be to get generalisation to unseen topics. 2. The paper does not contain other model scales or model families. Given that the authors are testing the capabilities of language models in general, this is an important ablation to include. I would highly recommend including results for at least one other model

Reviewer 05Rating 6Confidence 3

Strengths

1. The paper identifies a previously unexplored AI safety risk: that a model could learn to hide or obfuscate its internal activations to evade interpretability-based monitoring tools. 2. The authors clearly define a worst-case misalignment scenario, where a model actively conceals its internal states. 3.The findings have direct implications for safety auditing, interpretability research, and model governance. 4. The threat model is well-defined, the methodology clean and replicable, and th

Weaknesses

1. The mechanistic analysis (low-rank activation manipulation) is intriguing but relatively shallow. The paper does not identify which layers or circuits perform self-obfuscation, nor does it analyze whether the phenomenon corresponds to intentional “deception-like” representations. More detailed causal or circuit-level tracing would strengthen the mechanistic claims. 2. Experiments appear to use mid-sized LLMs (not frontier-scale models). It remains uncertain whether the same evasion dynamics

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI