Obfuscated Activations Bypass LLM Latent-Space Defenses

Luke Bailey; Alex Serrano; Abhay Sheshadri; Mikhail Seleznyov; and Jordan Taylor; Erik Jenner; Jacob Hilton; Stephen Casper and; Carlos Guestrin; Scott Emmons

arXiv:2412.09565·cs.LG·February 11, 2025

Obfuscated Activations Bypass LLM Latent-Space Defenses

Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, and Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper and, Carlos Guestrin, Scott Emmons

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that current latent-space defenses against LLM attacks can be bypassed through obfuscated activations, revealing vulnerabilities and highlighting the malleability of neural activation patterns.

Contribution

It introduces methods to obfuscate activations that defeat state-of-the-art defenses, exposing fundamental weaknesses in latent-space monitoring techniques.

Findings

01

Latent-space defenses are vulnerable to obfuscated activations.

02

Obfuscation can drastically reduce harmfulness detection recall.

03

Model performance can be affected by obfuscation on complex tasks.

Abstract

Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- The authors show that existing latent space detection methods for jailbreaks and backdoored samples can be evaded with obfuscation attacks. The results suggest that models have multiple independent representations of a given harmful concept, posing difficulties for future latent space detection methods. - The paper is well written and the obfuscation attack formulation is clean and natural. - The discussion and the additional experiments presented are helpful, in particular the analysis of adv

Weaknesses

- The strongest results are obtained with optimization of raw embeddings and don't translate to real-world attacks; realistic attacks (the "hard prompt") setting were of significantly reduced effectiveness, in particular against MLP generation probes. Hence, it is unclear whether these defenses are indeed robust on distribution or whether the limited effectiveness of obfuscation attacks observed in this setting is due to the difficulty of discrete optimization. - The obfuscation attack formulati

Reviewer 02Rating 8Confidence 3

Strengths

- The motivation is clear, and the paper clarifies concepts well. - The paper is technically solid with experiments across tasks, attacks, and defenses. - This attack is novel and is generalizable to current systems.

Weaknesses

- This attack is limited by the strong assumption of white-box access. - Some hyperparameters are not introduced: how many iterations does this attack need based on different model sizes/sequence lengths? - Benign performance is not reported; it’s unclear how stealthy this type of attack is on the output space. Also, for such jailbreaking attacks, not all successful attacks are equally harmful/misaligned. It would be nice to see whether the harmful output is closer to original semantics or not.

Reviewer 03Rating 4Confidence 4

Strengths

- Comprehensive experimental evaluation across multiple defense mechanisms (supervised probes, OOD detectors, adversarially trained monitors) - Implications of discoveries are carefully explored and discussed - Important study of the effectiveness of different monitors providing insight for the community as to when/what monitors are effective

Weaknesses

- Many of these discoveries seem unsurprising/not novel given existing literature in adversarial/robust ML especially given papers written in other domains - While the paper shows obfuscation tax on SQL code generation, the implications it draws from this limited experiment seem like too big of a stretch to be a contribution to me - Lacking details on additional cost of obfuscation attacks, how much harder does this make the problem in terms of attack runtime/convergence time etc. - The paper ha

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing