When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided   Search

Xuan Chen; Yuzhou Nie; Wenbo Guo; Xiangyu Zhang

arXiv:2406.08705·cs.CR·January 28, 2025·1 cites

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

Xuan Chen, Yuzhou Nie, Wenbo Guo, Xiangyu Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces RLbreaker, a deep reinforcement learning-based method for black-box jailbreaking of large language models, significantly improving attack effectiveness and robustness over existing genetic algorithm approaches.

Contribution

The paper presents RLbreaker, a novel DRL-guided search framework with a custom reward and PPO algorithm, advancing jailbreaking attack efficiency and transferability.

Findings

01

RLbreaker outperforms existing attacks on six SOTA LLMs.

02

RLbreaker remains effective against three SOTA defenses.

03

Trained RL agents transfer across different LLMs.

Abstract

Recent studies developed jailbreaking attacks, which construct jailbreaking prompts to fool LLMs into responding to harmful questions. Early-stage jailbreaking attacks require access to model internals or significant human efforts. More advanced attacks utilize genetic algorithms for automatic and black-box attacks. However, the random nature of genetic algorithms significantly limits the effectiveness of these attacks. In this paper, we propose RLbreaker, a black-box jailbreaking attack driven by deep reinforcement learning (DRL). We model jailbreaking as a search problem and design an RL agent to guide the search, which is more effective and has less randomness than stochastic search, such as genetic algorithms. Specifically, we design a customized DRL system for the jailbreaking problem, including a novel reward function and a customized proximal policy optimization (PPO) algorithm.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucsb-mlsec/rlbreaker
pytorchOfficial

Videos

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search· slideslive

Taxonomy

TopicsDigital and Cyber Forensics · Digital Rights Management and Security · Artificial Intelligence in Law