Latent-space Attacks for Refusal Evasion in Language Models
Giorgio Piras, Raffaele Mura, Fabio Brau, Maura Pintor, Luca Oneto, Fabio Roli, Battista Biggio

TL;DR
This paper introduces a new perspective on refusal suppression in language models as a latent-space evasion attack, leading to a novel attack method that outperforms existing baselines.
Contribution
It recasts refusal suppression as a latent-space evasion problem and proposes a controlled attack that surpasses prior methods in effectiveness.
Findings
Achieves state-of-the-art attack success rate across 15 models.
Outperforms existing refusal-ablation baselines.
Effectively pushes representations into the model's answer region.
Abstract
Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
