Latent-space Attacks for Refusal Evasion in Language Models

Giorgio Piras; Raffaele Mura; Fabio Brau; Maura Pintor; Luca Oneto; Fabio Roli; Battista Biggio

arXiv:2605.21706·cs.AI·May 22, 2026

Latent-space Attacks for Refusal Evasion in Language Models

Giorgio Piras, Raffaele Mura, Fabio Brau, Maura Pintor, Luca Oneto, Fabio Roli, Battista Biggio

PDF

TL;DR

This paper introduces a new perspective on refusal suppression in language models as a latent-space evasion attack, leading to a novel attack method that outperforms existing baselines.

Contribution

It recasts refusal suppression as a latent-space evasion problem and proposes a controlled attack that surpasses prior methods in effectiveness.

Findings

01

Achieves state-of-the-art attack success rate across 15 models.

02

Outperforms existing refusal-ablation baselines.

03

Effectively pushes representations into the model's answer region.

Abstract

Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.