Jailbreaking? One Step Is Enough!
Weixiong Zheng, Peijian Zeng, Yiwei Li, Hongyan Wu, Nankai Lin, Junhao, Chen, Aimin Yang, Yongmei Zhou

TL;DR
This paper introduces REDA, a novel attack method that disguises harmful prompts as defensive responses, enabling effective, cross-model jailbreak attacks in a single step without redesigning for each model.
Contribution
REDA is a new attack mechanism that disguises harmful content as defense, allowing one-step, cross-model jailbreaks without needing to redesign attacks for different models.
Findings
REDA enables successful jailbreaks in one iteration.
It works across multiple models without redesign.
It outperforms existing jailbreak methods.
Abstract
Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs. However, current jailbreak methods and the target model's defenses are engaged in an independent and adversarial process, resulting in the need for frequent attack iterations and redesigning attacks for different models. To address these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense". intention against harmful content. Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task. The attacking model considers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCybercrime and Law Enforcement Studies
MethodsADaptive gradient method with the OPTimal convergence rate
