Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models
Shubham Kumar, Narendra Ahuja

TL;DR
This paper introduces LOCA, a method for providing local, causal explanations for why specific jailbreak prompts succeed in bypassing safety measures in large language models, by identifying minimal changes that induce refusal.
Contribution
LOCA offers the first local, causal explanation approach for individual jailbreak success, outperforming prior global explanation methods in reducing necessary changes.
Findings
LOCA can induce model refusal with an average of six changes.
Prior methods often fail to induce refusal even after 20 changes.
LOCA advances understanding of jailbreak mechanisms in LLMs.
Abstract
Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in higher-stakes settings may similarly be vulnerable to such attacks. Prior work has studied jailbreak success by examining the model's intermediate representations, identifying directions in this space that causally encode concepts like harmfulness and refusal. Then, they globally explain all jailbreak attacks as attempting to reduce or strengthen these concepts (e.g., reduce harmfulness). However, different jailbreak strategies may succeed by strengthening or suppressing different intermediate concepts, and the same jailbreak strategy may not work for different harmful request categories (e.g., violence vs. cyberattack); thus, we seek…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
