Can LLMs Deeply Detect Complex Malicious Queries? A Framework for   Jailbreaking via Obfuscating Intent

Shang Shang; Xinqiang Zhao; Zhongjiang Yao; Yepeng Yao; Liya Su,; Zijing Fan; Xiaodan Zhang; Zhengwei Jiang

arXiv:2405.03654·cs.CR·May 8, 2024·1 cites

Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent

Shang Shang, Xinqiang Zhao, Zhongjiang Yao, Yepeng Yao, Liya Su,, Zijing Fan, Xiaodan Zhang, Zhengwei Jiang

PDF

Open Access

TL;DR

This paper introduces IntentObfuscator, a black-box attack method that obfuscates user intent to bypass LLM content filters, demonstrating high success rates across multiple models and sensitive content types.

Contribution

The paper presents a novel framework and attack methodology for evading LLM content security by obfuscating malicious intent, with empirical validation on several popular models.

Findings

01

Achieved an average jailbreak success rate of 69.21%.

02

ChatGPT-3.5 had a success rate of 83.65%.

03

Effective against various sensitive content types.

Abstract

To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures. We detail two implementations under this framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to evade malicious intent detection effectively. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21\%. Notably, our tests on ChatGPT-3.5, which claims 100 million weekly active users,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Cryptography and Data Security · Advanced Malware Detection Techniques