Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

Kemal Derya; Berk Sunar

arXiv:2605.03095·cs.CR·May 6, 2026

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

Kemal Derya, Berk Sunar

PDF

TL;DR

This paper revisits the JBShield jailbreak defense for LLMs, proposing a new attack (JB-GCG) and a robust detection method (RTV), demonstrating the importance of multi-layer representation analysis for security.

Contribution

It introduces JB-GCG, an effective adaptive attack, and RTV, a multi-layer representation-based detection method, highlighting the limitations of single-layer concept similarity defenses.

Findings

01

JB-GCG achieves up to 53.4% attack success rate against JBShield.

02

RTV detects jailbreak prompts with 0.99 AUROC, significantly improving robustness.

03

Adaptive attacks remain challenging, with RTV reducing success to 7%.

Abstract

Defending large language models (LLMs) against jailbreak attacks, such as Greedy Coordinate Gradient (GCG), remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in some settings, detects malicious prompts via two concept signals, a toxic concept and a jailbreak concept. We design JB-GCG, which modifies GCG's objective to combine two terms: refusal-direction suppression via cosine similarity between the refusal direction and hidden-state representations, and toxic-concept regularization via JBShield's own toxic concept score. Across five configurations on Llama-3-8B, JB-GCG achieves an average ASR of 46.2%, reaching up to 53.4% in the strongest setting. We further show that our attack remains effective against JBShield-M, achieving ASR up to 30.7% across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.