TL;DR
This paper investigates whether large language models can learn to evade internal latent-space monitors designed to detect undesirable behaviors, revealing vulnerabilities and generalization capabilities of such models.
Contribution
Introduces RL-Obfuscation, a reinforcement learning approach enabling LLMs to evade latent-space monitors while preserving output quality, and evaluates its effectiveness across different monitor types.
Findings
Token-level monitors are highly vulnerable to evasion.
Holistic monitors like max-pooling are more robust.
Models can generalize evasion to unseen monitors.
Abstract
Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions. However, these monitors may themselves become training signals, for example, by using problematic samples found in deployment to retrain models. This raises an important question: can models learn to evade such monitors? To evaluate this capability, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors while maintaining their blackbox behaviour. We apply RL-Obfuscation to Language Models ranging from 7B to 14B parameters and evaluate their Evasion Success Rate against a suite of monitors. We find that token-level monitors are highly vulnerable to this…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The problem is novel and interesting. This is an interesting red-team method showing that LLMs can be adversarial trained to bypass latent-space monitoring system. 2. The paper is well written and well presented. 3. The findings are insightful and the analysis are in-depth, demonstrating that different types of monitors shows different robustness level toward this attack method.
1. The paper would be beneficial to expand the scope of tasks. Current experiments focus on synthetic tasks. Adding more full instruction-following or open-domain generation tasks can strengthen the paper. 2. More interpretability on explaining how internal representations shift or which features the model manipulates can further strengthen the paper. 3. More discussion on key-components on more robust latent-space monitors can further strengthen the paper.
The article introduces a novel technique to evade latent space monitoring, clearly states the conditions of the attack and show it's efficiency on a large set of probes, models, with various intensities and size of models. The structure is clear, and summarized regularly with key take aways across the publication. Very complete appendices are completing the article, which is appreciated for the details and the reproducibility.
As for any adaptive attack, the proposition of an adaptive defense following the same principles would have been appreciated, but this could be more fitting for future works. The content of the article is quite dense, even if detailed, and was hard to read.
+ Novel and Important Research Direction: The paper addresses a critical gap in AI safety research by investigating whether models can autonomously learn to evade safety monitors through RL, complementing existing adversarial attack methodologies. + Comprehensive Experimental Design: The study evaluates multiple model sizes (7B-14B parameters), various probe architectures (linear, MLP, attention), and different aggregation strategies (mean, median, max), providing broad empirical coverage. + P
+ Limited Scope of Latent-Space Monitors: The study focuses exclusively on probe-based monitors (linear, MLP, attention). Other monitoring approaches like SAE-based probes, latent OOD detectors, or ensemble methods are mentioned but not evaluated, limiting generalizability of conclusions. + Unclear Threat Model: The paper conflates different attacker capabilities—sometimes assuming black-box access (no gradients), sometimes white-box (full weights access for fine-tuning). The practical scenario
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
