MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

Fengxiang Wang; Ranjie Duan; Peng Xiao; Xiaojun Jia; Shiji Zhao; Cheng; Wei; YueFeng Chen; Chongwen Wang; Jialing Tao; Hang Su; Jun Zhu; Hui Xue

arXiv:2411.03814·cs.AI·January 8, 2025

MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

Fengxiang Wang, Ranjie Duan, Peng Xiao, Xiaojun Jia, Shiji Zhao, Cheng, Wei, YueFeng Chen, Chongwen Wang, Jialing Tao, Hang Su, Jun Zhu, Hui Xue

PDF

Open Access

TL;DR

This paper introduces MRJ-Agent, a novel multi-round dialogue jailbreaking method that improves attack success rates on LLMs by using risk decomposition and psychological strategies, highlighting vulnerabilities in multi-turn interactions.

Contribution

The paper presents a new multi-round dialogue jailbreaking agent that outperforms existing methods by emphasizing stealthiness and risk distribution across dialogue rounds.

Findings

01

Achieves state-of-the-art attack success rate.

02

Outperforms previous jailbreak methods.

03

Highlights vulnerabilities in multi-round LLM interactions.

Abstract

Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities, but they have also been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks. To ensure their responsible deployment in critical applications, it is crucial to understand the safety capabilities and vulnerabilities of LLMs. Previous works mainly focus on jailbreak in single-round dialogue, overlooking the potential jailbreak risks in multi-round dialogues, which are a vital way humans interact with and extract information from LLMs. Some studies have increasingly concentrated on the risks associated with jailbreak in multi-round dialogues. These efforts typically involve the use of manually crafted templates or prompt engineering techniques. However, due to the inherent complexity of multi-round dialogues, their jailbreak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Interpreting and Communication in Healthcare · Language, Discourse, Communication Strategies

MethodsFocus