Distract Large Language Models for Automatic Jailbreak Attack

Zeguan Xiao; Yan Yang; Guanhua Chen; Yun Chen

arXiv:2403.08424·cs.CR·October 1, 2024·1 cites

Distract Large Language Models for Automatic Jailbreak Attack

Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel black-box framework for automatically jailbreaking large language models, revealing vulnerabilities and emphasizing the need for improved defense strategies.

Contribution

It presents an innovative iterative optimization-based method for black-box jailbreak attacks, demonstrating superior effectiveness, scalability, and transferability over existing approaches.

Findings

01

The framework successfully jailbreaks various open-source and proprietary LLMs.

02

Existing defenses are ineffective against the proposed attack.

03

The study highlights the urgent need for better LLM safety measures.

Abstract

Extensive efforts have been made before the public release of Large language models (LLMs) to align their behaviors with human values. However, even meticulously aligned LLMs remain vulnerable to malicious manipulations such as jailbreaking, leading to unintended behaviors. In this work, we propose a novel black-box jailbreak framework for automated red teaming of LLMs. We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs, motivated by the research about the distractibility and over-confidence phenomenon of LLMs. Extensive experiments of jailbreaking both open-source and proprietary LLMs demonstrate the superiority of our framework in terms of effectiveness, scalability and transferability. We also evaluate the effectiveness of existing jailbreak defense methods against our attack and highlight the crucial need to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sufenlp/AttanttionShiftJailbreak
pytorchOfficial

Videos

Distract Large Language Models for Automatic Jailbreak Attack· underline

Taxonomy

TopicsDigital and Cyber Forensics · Hate Speech and Cyberbullying Detection · Cybercrime and Law Enforcement Studies

MethodsALIGN