RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack   against LLMs

Xuan Chen; Yuzhou Nie; Lu Yan; Yunshu Mao; Wenbo Guo; Xiangyu Zhang

arXiv:2406.08725·cs.CR·June 14, 2024

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

Xuan Chen, Yuzhou Nie, Lu Yan, Yunshu Mao, Wenbo Guo, Xiangyu Zhang

PDF

Open Access

TL;DR

This paper introduces RL-JACK, a reinforcement learning-based black-box attack method that effectively bypasses safety measures in large language models by generating jailbreaking prompts more efficiently than previous methods.

Contribution

RL-JACK is the first to formulate jailbreaking prompt generation as a reinforcement learning problem with novel design choices, significantly improving attack success rates and robustness.

Findings

01

RL-JACK outperforms existing attacks on six SOTA LLMs.

02

RL-JACK is resilient against current defenses.

03

The method is transferably effective across different models.

Abstract

Modern large language model (LLM) developers typically conduct a safety alignment to prevent an LLM from generating unethical or harmful content. Recent studies have discovered that the safety alignment of LLMs can be bypassed by jailbreaking prompts. These prompts are designed to create specific conversation scenarios with a harmful question embedded. Querying an LLM with such prompts can mislead the model into responding to the harmful question. The stochastic and random nature of existing genetic methods largely limits the effectiveness and efficiency of state-of-the-art (SOTA) jailbreaking attacks. In this paper, we propose RL-JACK, a novel black-box jailbreaking attack powered by deep reinforcement learning (DRL). We formulate the generation of jailbreaking prompts as a search problem and design a novel RL approach to solve it. Our method includes a series of customized designs to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCybercrime and Law Enforcement Studies · Digital and Cyber Forensics · Law, AI, and Intellectual Property