WordGame: Efficient & Effective LLM Jailbreak via Simultaneous   Obfuscation in Query and Response

Tianrong Zhang; Bochuan Cao; Yuanpu Cao; Lu Lin; Prasenjit Mitra,; Jinghui Chen

arXiv:2405.14023·cs.LG·May 24, 2024

WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response

Tianrong Zhang, Bochuan Cao, Yuanpu Cao, Lu Lin, Prasenjit Mitra,, Jinghui Chen

PDF

Open Access

TL;DR

This paper introduces WordGame, a novel attack method that simultaneously obfuscates queries and responses to bypass safety measures in large language models, revealing vulnerabilities in current safety alignment techniques.

Contribution

The paper presents a new attack strategy called WordGame that exploits safety alignment patterns by obfuscating both queries and responses, demonstrating its effectiveness against leading LLMs.

Findings

01

WordGame successfully bypasses safety guardrails in ChatGPT, GPT-4, Claude-3, and Llama-3.

02

Simultaneous obfuscation in query and response enhances attack effectiveness.

03

The attack reveals limitations in current safety alignment approaches.

Abstract

The recent breakthrough in large language models (LLMs) such as ChatGPT has revolutionized production processes at an unprecedented pace. Alongside this progress also comes mounting concerns about LLMs' susceptibility to jailbreaking attacks, which leads to the generation of harmful or unsafe content. While safety alignment measures have been implemented in LLMs to mitigate existing jailbreak attempts and force them to become increasingly complicated, it is still far from perfect. In this paper, we analyze the common pattern of the current safety alignment and show that it is possible to exploit such patterns for jailbreaking attacks by simultaneous obfuscation in queries and responses. Specifically, we propose WordGame attack, which replaces malicious words with word games to break down the adversarial intent of a query and encourage benign content regarding the games to precede the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Advanced Malware Detection Techniques · Web Application Security Vulnerabilities

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Dropout