Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

TL;DR
This paper introduces SALO, a novel method that exploits latent refusal trajectories for robust jailbreak detection, significantly outperforming traditional static refusal-based approaches.
Contribution
It uncovers the dynamic nature of refusal signals and proposes SALO to detect adversarial attacks by analyzing persistent upstream signatures.
Findings
SALO improves detection rates from ~0% to over 90%.
Refusal trajectories are persistent upstream signatures.
Dynamic analysis outperforms static terminal representations.
Abstract
Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inference-time detector designed to capture these latent patterns. SALO effectively recovers defense capabilities against forced-decoding attacks, improving detection rates from ~0% to >90% where methods relying on terminal states perform poorly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
