Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Xulin Hu; Che Wang; Wei Yang Bryan Lim; Jianbo Gao; Zhong Chen

arXiv:2605.02958·cs.CR·May 6, 2026

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

PDF

TL;DR

This paper introduces SALO, a novel method that exploits latent refusal trajectories for robust jailbreak detection, significantly outperforming traditional static refusal-based approaches.

Contribution

It uncovers the dynamic nature of refusal signals and proposes SALO to detect adversarial attacks by analyzing persistent upstream signatures.

Findings

01

SALO improves detection rates from ~0% to over 90%.

02

Refusal trajectories are persistent upstream signatures.

03

Dynamic analysis outperforms static terminal representations.

Abstract

Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inference-time detector designed to capture these latent patterns. SALO effectively recovers defense capabilities against forced-decoding attacks, improving detection rates from ~0% to >90% where methods relying on terminal states perform poorly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.