Involuntary Jailbreak: On Self-Prompting Attacks
Yangyang Guo, Yangyan Li, Mohan Kankanhalli

TL;DR
This paper uncovers a new vulnerability in large language models called involuntary jailbreak, where a simple universal prompt can bypass guardrails and generate harmful content, revealing fragility in current safety measures.
Contribution
The study introduces involuntary jailbreak, a novel type of attack that can compromise entire LLM guardrails using a single prompt, highlighting the need for improved safety robustness.
Findings
Most leading LLMs can be jailbroken with a simple prompt
Involuntary jailbreak can bypass existing guardrails effectively
The vulnerability is widespread across multiple models
Abstract
In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCriminal Law and Policy · Criminal Law and Evidence
