BlackDAN: A Black-Box Multi-Objective Approach for Effective and   Contextual Jailbreaking of Large Language Models

Xinyuan Wang; Victor Shea-Jay Huang; Renmiao Chen; Hao Wang; Chengwei; Pan; Lei Sha; Minlie Huang

arXiv:2410.09804·cs.CR·November 28, 2024

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models

Xinyuan Wang, Victor Shea-Jay Huang, Renmiao Chen, Hao Wang, Chengwei, Pan, Lei Sha, Minlie Huang

PDF

Open Access

TL;DR

BlackDAN introduces a multi-objective black-box attack framework using evolutionary algorithms to generate effective, relevant, and stealthy jailbreak prompts for large language models, surpassing traditional methods.

Contribution

It pioneers a multi-objective optimization approach for jailbreak prompts, balancing success rate, relevance, and stealthiness using evolutionary algorithms.

Findings

01

Outperforms traditional single-objective methods in success rate

02

Generates more relevant and less detectable jailbreak responses

03

Demonstrates robustness across various LLMs and multimodal models

Abstract

While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics

MethodsFocus