JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Wenhui Zhang, Qinglong, Wang, and Rui Zheng

TL;DR
JailbreakLens offers a comprehensive interpretation framework that analyzes how jailbreak prompts manipulate model representations and circuits, revealing the underlying mechanisms that cause LLMs to produce unsafe responses.
Contribution
This paper introduces JailbreakLens, a novel framework that jointly analyzes representation shifts and circuit activations to understand jailbreak mechanisms in LLMs.
Findings
Jailbreak prompts shift model representations toward safe clusters.
Manipulation amplifies affirmative response components and suppresses refusal signals.
Strong correlation between representation deception and circuit activation shifts.
Abstract
Despite the outstanding performance of Large language Models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses. Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explained typical jailbreaking behavior (e.g., the degree to which the model refuses to respond) of LLMs by analyzing representation shifts in their latent space caused by jailbreak prompts or identifying key neurons that contribute to the success of jailbreak attacks. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to the changes of representational, leaving significant gaps in uncovering the jailbreak mechanism. In this paper, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw in Society and Culture · Digital Media Forensic Detection
