What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Nathalie Kirch; Constantin Weisser; Severin Field; Helen Yannakoudakis; Stephen Casper

arXiv:2411.03343·cs.CR·November 4, 2025

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Nathalie Kirch, Constantin Weisser, Severin Field, Helen Yannakoudakis, Stephen Casper

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the internal features and mechanisms that enable successful jailbreak attacks on large language models, revealing that non-linear features play a significant role and proposing a probe-guided intervention approach.

Contribution

It introduces a novel dataset of jailbreak attempts, analyzes linear and non-linear features in prompts, and develops a probe-guided intervention method to understand and influence jailbreak success.

Findings

01

Non-linear features are crucial for jailbreak success.

02

Different jailbreaks rely on distinct internal mechanisms.

03

Non-linear probes produce more effective interventions.

Abstract

Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success. Probes achieve strong in-distribution accuracy but transfer is attack-family-specific, revealing that different jailbreaks are supported by distinct internal mechanisms rather than a single universal direction. To establish causal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NLie2/what_features_jailbreak_LLMs
pytorchOfficial

Videos

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks· underline

Taxonomy

TopicsDigital and Cyber Forensics · Cybercrime and Law Enforcement Studies · Law, AI, and Intellectual Property

MethodsFocus