Break Me If You Can: Self-Jailbreaking of Aligned LLMs via Lexical Insertion Prompting
Devang Kulshreshtha, Hang Su, Haibo Jin, Chinmay Hegde, Haohan Wang

TL;DR
This paper introduces self-jailbreaking, a novel threat model where an aligned language model can be compromised using its own internal knowledge, demonstrated through a new lexical insertion prompting method with high success rates.
Contribution
The paper presents SLIP, a black-box algorithm for self-jailbreaking LLMs, achieving high attack success rates with fewer model calls and analyzing defenses like regex and embedding-based detection.
Findings
SLIP achieves 90-100% attack success rate across multiple models.
SLIP requires approximately 7.9 LLM calls on average, fewer than prior methods.
The Semantic Drift Monitor detects 76% of attacks at 5% FPR, but remains vulnerable to adaptive strategies.
Abstract
We introduce \emph{self-jailbreaking}, a threat model in which an aligned LLM guides its own compromise. Unlike most jailbreak techniques, which often rely on handcrafted prompts or separate attacker models, self-jailbreaking requires no external red-team LLM: the target model's own internal knowledge suffices. We operationalize this via \textbf{Self-Jailbreaking via Lexical Insertion Prompting (\textsc{SLIP})}, a black-box algorithm that casts jailbreaking as breadth-first tree search over multi-turn dialogues, incrementally inserting missing content words from the attack goal into benign prompts using the target model as its own guide. Evaluations on AdvBench and HarmBench show \textsc{SLIP} achieves 90--100\% Attack Success Rate (ASR) (avg.\ 94.7\%) across most of the eleven tested models (including GPT-5.1, Claude-Sonnet-4.5, Gemini-2.5-Pro, and DeepSeek-V3), with only …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
