Self-Deception: Reverse Penetrating the Semantic Firewall of Large   Language Models

Zhenhua Wang; Wei Xie; Kai Chen; Baosheng Wang; Zhiwen Gui; Enze Wang

arXiv:2308.11521·cs.CL·August 28, 2023·5 cites

Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models

Zhenhua Wang, Wei Xie, Kai Chen, Baosheng Wang, Zhiwen Gui, Enze Wang

PDF

Open Access

TL;DR

This paper introduces a novel 'self-deception' attack method that bypasses semantic defenses in large language models, demonstrating high success rates across multiple languages and scenarios, and providing insights into LLM security vulnerabilities.

Contribution

It proposes the first automatic jailbreak technique using self-deception to penetrate semantic firewalls in LLMs, with extensive multilingual experiments and open-source resources.

Findings

01

Success rates of 86.2% on GPT-3.5-Turbo and 67% on GPT-4

02

Generated 2,520 attack payloads across six languages

03

Effectiveness of self-deception attack demonstrated

Abstract

Large language models (LLMs), such as ChatGPT, have emerged with astonishing capabilities approaching artificial general intelligence. While providing convenience for various societal needs, LLMs have also lowered the cost of generating harmful content. Consequently, LLM developers have deployed semantic-level defenses to recognize and reject prompts that may lead to inappropriate content. Unfortunately, these defenses are not foolproof, and some attackers have crafted "jailbreak" prompts that temporarily hypnotize the LLM into forgetting content defense rules and answering any improper questions. To date, there is no clear explanation of the principles behind these semantic-level attacks and defenses in both industry and academia. This paper investigates the LLM jailbreak problem and proposes an automatic jailbreak method for the first time. We propose the concept of a semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Network Security and Intrusion Detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Position-Wise Feed-Forward Layer · Linear Layer · Dense Connections · Weight Decay · Absolute Position Encodings