Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
Kemal Derya, Berk Sunar

TL;DR
This paper revisits the JBShield jailbreak defense for LLMs, proposing a new attack (JB-GCG) and a robust detection method (RTV), demonstrating the importance of multi-layer representation analysis for security.
Contribution
It introduces JB-GCG, an effective adaptive attack, and RTV, a multi-layer representation-based detection method, highlighting the limitations of single-layer concept similarity defenses.
Findings
JB-GCG achieves up to 53.4% attack success rate against JBShield.
RTV detects jailbreak prompts with 0.99 AUROC, significantly improving robustness.
Adaptive attacks remain challenging, with RTV reducing success to 7%.
Abstract
Defending large language models (LLMs) against jailbreak attacks, such as Greedy Coordinate Gradient (GCG), remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in some settings, detects malicious prompts via two concept signals, a toxic concept and a jailbreak concept. We design JB-GCG, which modifies GCG's objective to combine two terms: refusal-direction suppression via cosine similarity between the refusal direction and hidden-state representations, and toxic-concept regularization via JBShield's own toxic concept score. Across five configurations on Llama-3-8B, JB-GCG achieves an average ASR of 46.2%, reaching up to 53.4% in the strongest setting. We further show that our attack remains effective against JBShield-M, achieving ASR up to 30.7% across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
