Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models
Zhenhua Wang, Wei Xie, Kai Chen, Baosheng Wang, Zhiwen Gui, Enze Wang

TL;DR
This paper introduces a novel 'self-deception' attack method that bypasses semantic defenses in large language models, demonstrating high success rates across multiple languages and scenarios, and providing insights into LLM security vulnerabilities.
Contribution
It proposes the first automatic jailbreak technique using self-deception to penetrate semantic firewalls in LLMs, with extensive multilingual experiments and open-source resources.
Findings
Success rates of 86.2% on GPT-3.5-Turbo and 67% on GPT-4
Generated 2,520 attack payloads across six languages
Effectiveness of self-deception attack demonstrated
Abstract
Large language models (LLMs), such as ChatGPT, have emerged with astonishing capabilities approaching artificial general intelligence. While providing convenience for various societal needs, LLMs have also lowered the cost of generating harmful content. Consequently, LLM developers have deployed semantic-level defenses to recognize and reject prompts that may lead to inappropriate content. Unfortunately, these defenses are not foolproof, and some attackers have crafted "jailbreak" prompts that temporarily hypnotize the LLM into forgetting content defense rules and answering any improper questions. To date, there is no clear explanation of the principles behind these semantic-level attacks and defenses in both industry and academia. This paper investigates the LLM jailbreak problem and proposes an automatic jailbreak method for the first time. We propose the concept of a semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Network Security and Intrusion Detection
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Position-Wise Feed-Forward Layer · Linear Layer · Dense Connections · Weight Decay · Absolute Position Encodings
