RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Rohan Gupta; Erik Jenner

arXiv:2506.14261·cs.LG·February 27, 2026

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Rohan Gupta, Erik Jenner

PDF

1 Repo 3 Reviews

TL;DR

This paper investigates whether large language models can learn to evade internal latent-space monitors designed to detect undesirable behaviors, revealing vulnerabilities and generalization capabilities of such models.

Contribution

Introduces RL-Obfuscation, a reinforcement learning approach enabling LLMs to evade latent-space monitors while preserving output quality, and evaluates its effectiveness across different monitor types.

Findings

01

Token-level monitors are highly vulnerable to evasion.

02

Holistic monitors like max-pooling are more robust.

03

Models can generalize evasion to unseen monitors.

Abstract

Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions. However, these monitors may themselves become training signals, for example, by using problematic samples found in deployment to retrain models. This raises an important question: can models learn to evade such monitors? To evaluate this capability, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors while maintaining their blackbox behaviour. We apply RL-Obfuscation to Language Models ranging from 7B to 14B parameters and evaluate their Evasion Success Rate against a suite of monitors. We find that token-level monitors are highly vulnerable to this…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

1. The problem is novel and interesting. This is an interesting red-team method showing that LLMs can be adversarial trained to bypass latent-space monitoring system. 2. The paper is well written and well presented. 3. The findings are insightful and the analysis are in-depth, demonstrating that different types of monitors shows different robustness level toward this attack method.

Weaknesses

1. The paper would be beneficial to expand the scope of tasks. Current experiments focus on synthetic tasks. Adding more full instruction-following or open-domain generation tasks can strengthen the paper. 2. More interpretability on explaining how internal representations shift or which features the model manipulates can further strengthen the paper. 3. More discussion on key-components on more robust latent-space monitors can further strengthen the paper.

Reviewer 02Rating 6Confidence 3

Strengths

The article introduces a novel technique to evade latent space monitoring, clearly states the conditions of the attack and show it's efficiency on a large set of probes, models, with various intensities and size of models. The structure is clear, and summarized regularly with key take aways across the publication. Very complete appendices are completing the article, which is appreciated for the details and the reproducibility.

Weaknesses

As for any adaptive attack, the proposition of an adaptive defense following the same principles would have been appreciated, but this could be more fitting for future works. The content of the article is quite dense, even if detailed, and was hard to read.

Reviewer 03Rating 4Confidence 4

Strengths

+ Novel and Important Research Direction: The paper addresses a critical gap in AI safety research by investigating whether models can autonomously learn to evade safety monitors through RL, complementing existing adversarial attack methodologies. + Comprehensive Experimental Design: The study evaluates multiple model sizes (7B-14B parameters), various probe architectures (linear, MLP, attention), and different aggregation strategies (mean, median, max), providing broad empirical coverage. + P

Weaknesses

+ Limited Scope of Latent-Space Monitors: The study focuses exclusively on probe-based monitors (linear, MLP, attention). Other monitoring approaches like SAE-based probes, latent OOD detectors, or ensemble methods are mentioned but not evaluated, limiting generalizability of conclusions. + Unclear Threat Model: The paper conflates different attacker capabilities—sometimes assuming black-box access (no gradients), sometimes white-box (full weights access for fine-tuning). The practical scenario

Code & Models

Repositories

cybershiptrooper/obfuscated_backdoors
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.