A Causal Perspective for Enhancing Jailbreak Attack and Defense

Licheng Pan; Yunsheng Lu; Jiexi Liu; Jialing Tao; Haozhe Feng; Hui Xue; Zhixuan Chu; Kui Ren

arXiv:2602.04893·cs.LG·February 6, 2026

A Causal Perspective for Enhancing Jailbreak Attack and Defense

Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren

PDF

Open Access

TL;DR

This paper introduces Causal Analyst, a framework that uncovers causal relationships between prompt features and jailbreaks in LLMs, enabling improved attack and defense strategies through causal discovery and analysis.

Contribution

It presents a novel causal discovery framework integrating LLMs and GNNs, along with a large dataset and practical applications for enhancing jailbreak attack and defense.

Findings

01

Identified key causal features like 'Positive Character' and 'Number of Task Steps'.

02

Demonstrated improved attack success rates using causal insights.

03

Validated the robustness of causal analysis over non-causal methods.

Abstract

Uncovering the mechanisms behind "jailbreaks" in large language models (LLMs) is crucial for enhancing their safety and reliability, yet these mechanisms remain poorly understood. Existing studies predominantly analyze jailbreak prompts by probing latent representations, often overlooking the causal relationships between interpretable prompt features and jailbreak occurrences. In this work, we propose Causal Analyst, a framework that integrates LLMs into data-driven causal discovery to identify the direct causes of jailbreaks and leverage them for both attack and defense. We introduce a comprehensive dataset comprising 35k jailbreak attempts across seven LLMs, systematically constructed from 100 attack templates and 50 harmful queries, annotated with 37 meticulously designed human-readable prompt features. By jointly training LLM-based prompt encoding and GNN-based causal graph…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Misinformation and Its Impacts