A Causal Perspective for Enhancing Jailbreak Attack and Defense
Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren

TL;DR
This paper introduces Causal Analyst, a framework that uncovers causal relationships between prompt features and jailbreaks in LLMs, enabling improved attack and defense strategies through causal discovery and analysis.
Contribution
It presents a novel causal discovery framework integrating LLMs and GNNs, along with a large dataset and practical applications for enhancing jailbreak attack and defense.
Findings
Identified key causal features like 'Positive Character' and 'Number of Task Steps'.
Demonstrated improved attack success rates using causal insights.
Validated the robustness of causal analysis over non-causal methods.
Abstract
Uncovering the mechanisms behind "jailbreaks" in large language models (LLMs) is crucial for enhancing their safety and reliability, yet these mechanisms remain poorly understood. Existing studies predominantly analyze jailbreak prompts by probing latent representations, often overlooking the causal relationships between interpretable prompt features and jailbreak occurrences. In this work, we propose Causal Analyst, a framework that integrates LLMs into data-driven causal discovery to identify the direct causes of jailbreaks and leverage them for both attack and defense. We introduce a comprehensive dataset comprising 35k jailbreak attempts across seven LLMs, systematically constructed from 100 attack templates and 50 harmful queries, annotated with 37 meticulously designed human-readable prompt features. By jointly training LLM-based prompt encoding and GNN-based causal graph…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Misinformation and Its Impacts
