What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Kirch, Constantin Weisser, Severin Field, Helen Yannakoudakis, Stephen Casper

TL;DR
This paper investigates the internal features and mechanisms that enable successful jailbreak attacks on large language models, revealing that non-linear features play a significant role and proposing a probe-guided intervention approach.
Contribution
It introduces a novel dataset of jailbreak attempts, analyzes linear and non-linear features in prompts, and develops a probe-guided intervention method to understand and influence jailbreak success.
Findings
Non-linear features are crucial for jailbreak success.
Different jailbreaks rely on distinct internal mechanisms.
Non-linear probes produce more effective interventions.
Abstract
Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success. Probes achieve strong in-distribution accuracy but transfer is attack-family-specific, revealing that different jailbreaks are supported by distinct internal mechanisms rather than a single universal direction. To establish causal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital and Cyber Forensics · Cybercrime and Law Enforcement Studies · Law, AI, and Intellectual Property
MethodsFocus
