JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of   Representation and Circuit

Zeqing He; Zhibo Wang; Zhixuan Chu; Huiyu Xu; Wenhui Zhang; Qinglong; Wang; and Rui Zheng

arXiv:2411.11114·cs.CR·April 25, 2025

JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit

Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Wenhui Zhang, Qinglong, Wang, and Rui Zheng

PDF

Open Access

TL;DR

JailbreakLens offers a comprehensive interpretation framework that analyzes how jailbreak prompts manipulate model representations and circuits, revealing the underlying mechanisms that cause LLMs to produce unsafe responses.

Contribution

This paper introduces JailbreakLens, a novel framework that jointly analyzes representation shifts and circuit activations to understand jailbreak mechanisms in LLMs.

Findings

01

Jailbreak prompts shift model representations toward safe clusters.

02

Manipulation amplifies affirmative response components and suppresses refusal signals.

03

Strong correlation between representation deception and circuit activation shifts.

Abstract

Despite the outstanding performance of Large language Models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses. Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explained typical jailbreaking behavior (e.g., the degree to which the model refuses to respond) of LLMs by analyzing representation shifts in their latent space caused by jailbreak prompts or identifying key neurons that contribute to the success of jailbreak attacks. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to the changes of representational, leaving significant gaps in uncovering the jailbreak mechanism. In this paper, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw in Society and Culture · Digital Media Forensic Detection