Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Jun Zhuang; Haibo Jin; Ye Zhang; Zhengjian Kang; Wenbin Zhang; Gaby G. Dagher; Haohan Wang

arXiv:2505.18556·cs.CL·August 26, 2025

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang

PDF

1 Video

TL;DR

This paper reveals that large language models' intent-aware safety guardrails are vulnerable to malicious intent manipulation, demonstrating a new attack framework that significantly outperforms existing jailbreak methods and challenges current defenses.

Contribution

The authors introduce IntentPrompt, a novel two-stage prompt-refinement framework that effectively manipulates LLMs' intent detection to bypass safety guardrails, exposing critical weaknesses.

Findings

01

IntentPrompt achieves attack success rates up to 97% against defenses.

02

The framework outperforms existing jailbreak methods across multiple benchmarks.

03

Vulnerabilities persist even with advanced intent analysis and chain-of-thought defenses.

Abstract

Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation· underline