BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
Xinyuan Wang, Victor Shea-Jay Huang, Renmiao Chen, Hao Wang, Chengwei, Pan, Lei Sha, Minlie Huang

TL;DR
BlackDAN introduces a multi-objective black-box attack framework using evolutionary algorithms to generate effective, relevant, and stealthy jailbreak prompts for large language models, surpassing traditional methods.
Contribution
It pioneers a multi-objective optimization approach for jailbreak prompts, balancing success rate, relevance, and stealthiness using evolutionary algorithms.
Findings
Outperforms traditional single-objective methods in success rate
Generates more relevant and less detectable jailbreak responses
Demonstrates robustness across various LLMs and multimodal models
Abstract
While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics
MethodsFocus
