Break Me If You Can: Self-Jailbreaking of Aligned LLMs via Lexical Insertion Prompting

Devang Kulshreshtha; Hang Su; Haibo Jin; Chinmay Hegde; Haohan Wang

arXiv:2601.02670·cs.CL·April 10, 2026

Break Me If You Can: Self-Jailbreaking of Aligned LLMs via Lexical Insertion Prompting

Devang Kulshreshtha, Hang Su, Haibo Jin, Chinmay Hegde, Haohan Wang

PDF

TL;DR

This paper introduces self-jailbreaking, a novel threat model where an aligned language model can be compromised using its own internal knowledge, demonstrated through a new lexical insertion prompting method with high success rates.

Contribution

The paper presents SLIP, a black-box algorithm for self-jailbreaking LLMs, achieving high attack success rates with fewer model calls and analyzing defenses like regex and embedding-based detection.

Findings

01

SLIP achieves 90-100% attack success rate across multiple models.

02

SLIP requires approximately 7.9 LLM calls on average, fewer than prior methods.

03

The Semantic Drift Monitor detects 76% of attacks at 5% FPR, but remains vulnerable to adaptive strategies.

Abstract

We introduce \emph{self-jailbreaking}, a threat model in which an aligned LLM guides its own compromise. Unlike most jailbreak techniques, which often rely on handcrafted prompts or separate attacker models, self-jailbreaking requires no external red-team LLM: the target model's own internal knowledge suffices. We operationalize this via \textbf{Self-Jailbreaking via Lexical Insertion Prompting (\textsc{SLIP})}, a black-box algorithm that casts jailbreaking as breadth-first tree search over multi-turn dialogues, incrementally inserting missing content words from the attack goal into benign prompts using the target model as its own guide. Evaluations on AdvBench and HarmBench show \textsc{SLIP} achieves 90--100\% Attack Success Rate (ASR) (avg.\ 94.7\%) across most of the eleven tested models (including GPT-5.1, Claude-Sonnet-4.5, Gemini-2.5-Pro, and DeepSeek-V3), with only $\sim 7.9$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.