Jailbreaking? One Step Is Enough!

Weixiong Zheng; Peijian Zeng; Yiwei Li; Hongyan Wu; Nankai Lin; Junhao; Chen; Aimin Yang; Yongmei Zhou

arXiv:2412.12621·cs.CL·December 18, 2024

Jailbreaking? One Step Is Enough!

Weixiong Zheng, Peijian Zeng, Yiwei Li, Hongyan Wu, Nankai Lin, Junhao, Chen, Aimin Yang, Yongmei Zhou

PDF

Open Access 1 Video

TL;DR

This paper introduces REDA, a novel attack method that disguises harmful prompts as defensive responses, enabling effective, cross-model jailbreak attacks in a single step without redesigning for each model.

Contribution

REDA is a new attack mechanism that disguises harmful content as defense, allowing one-step, cross-model jailbreaks without needing to redesign attacks for different models.

Findings

01

REDA enables successful jailbreaks in one iteration.

02

It works across multiple models without redesign.

03

It outperforms existing jailbreak methods.

Abstract

Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs. However, current jailbreak methods and the target model's defenses are engaged in an independent and adversarial process, resulting in the need for frequent attack iterations and redesigning attacks for different models. To address these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense". intention against harmful content. Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task. The attacking model considers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Jailbreaking? One Step Is Enough!· underline

Taxonomy

TopicsCybercrime and Law Enforcement Studies

MethodsADaptive gradient method with the OPTimal convergence rate