Involuntary Jailbreak: On Self-Prompting Attacks

Yangyang Guo; Yangyan Li; Mohan Kankanhalli

arXiv:2508.13246·cs.CR·December 30, 2025

Involuntary Jailbreak: On Self-Prompting Attacks

Yangyang Guo, Yangyan Li, Mohan Kankanhalli

PDF

Open Access

TL;DR

This paper uncovers a new vulnerability in large language models called involuntary jailbreak, where a simple universal prompt can bypass guardrails and generate harmful content, revealing fragility in current safety measures.

Contribution

The study introduces involuntary jailbreak, a novel type of attack that can compromise entire LLM guardrails using a single prompt, highlighting the need for improved safety robustness.

Findings

01

Most leading LLMs can be jailbroken with a simple prompt

02

Involuntary jailbreak can bypass existing guardrails effectively

03

The vulnerability is widespread across multiple models

Abstract

In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCriminal Law and Policy · Criminal Law and Evidence