TL;DR
This paper reveals that large language models' intent-aware safety guardrails are vulnerable to malicious intent manipulation, demonstrating a new attack framework that significantly outperforms existing jailbreak methods and challenges current defenses.
Contribution
The authors introduce IntentPrompt, a novel two-stage prompt-refinement framework that effectively manipulates LLMs' intent detection to bypass safety guardrails, exposing critical weaknesses.
Findings
IntentPrompt achieves attack success rates up to 97% against defenses.
The framework outperforms existing jailbreak methods across multiple benchmarks.
Vulnerabilities persist even with advanced intent analysis and chain-of-thought defenses.
Abstract
Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
