When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Xuan Chen, Yuzhou Nie, Wenbo Guo, Xiangyu Zhang

TL;DR
This paper introduces RLbreaker, a deep reinforcement learning-based method for black-box jailbreaking of large language models, significantly improving attack effectiveness and robustness over existing genetic algorithm approaches.
Contribution
The paper presents RLbreaker, a novel DRL-guided search framework with a custom reward and PPO algorithm, advancing jailbreaking attack efficiency and transferability.
Findings
RLbreaker outperforms existing attacks on six SOTA LLMs.
RLbreaker remains effective against three SOTA defenses.
Trained RL agents transfer across different LLMs.
Abstract
Recent studies developed jailbreaking attacks, which construct jailbreaking prompts to fool LLMs into responding to harmful questions. Early-stage jailbreaking attacks require access to model internals or significant human efforts. More advanced attacks utilize genetic algorithms for automatic and black-box attacks. However, the random nature of genetic algorithms significantly limits the effectiveness of these attacks. In this paper, we propose RLbreaker, a black-box jailbreaking attack driven by deep reinforcement learning (DRL). We model jailbreaking as a search problem and design an RL agent to guide the search, which is more effective and has less randomness than stochastic search, such as genetic algorithms. Specifically, we design a customized DRL system for the jailbreaking problem, including a novel reward function and a customized proximal policy optimization (PPO) algorithm.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital and Cyber Forensics · Digital Rights Management and Security · Artificial Intelligence in Law
