Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

Weiyang Guo; Zesheng Shi; Zhuo Li; Yequan Wang; Xuebo Liu; Wenya Wang; Fangming Liu; Min Zhang; Jing Li

arXiv:2506.00782·cs.AI·June 3, 2025

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

Weiyang Guo, Zesheng Shi, Zhuo Li, Yequan Wang, Xuebo Liu, Wenya Wang, Fangming Liu, Min Zhang, Jing Li

PDF

Open Access 1 Models

TL;DR

This paper introduces Jailbreak-R1, a reinforcement learning-based framework for automated red teaming of LLMs, which enhances the diversity and effectiveness of jailbreak prompts to improve safety testing.

Contribution

The paper presents a novel three-stage reinforcement learning framework for automated red teaming that balances prompt diversity and attack effectiveness in LLMs.

Findings

01

Outperforms existing methods in jailbreak prompt diversity and effectiveness

02

Improves efficiency of red team exploration

03

Provides a new perspective on automated red teaming

Abstract

As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose \ourapproach, a novel automated red teaming training framework that utilizes reinforcement learning to explore and generate more effective attack prompts while balancing their diversity. Specifically, it consists of three training stages: (1) Cold Start: The red team model is supervised and fine-tuned on a jailbreak dataset obtained through imitation learning. (2) Warm-up Exploration: The model is trained in jailbreak instruction following and exploration, using diversity and consistency as reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
yukiyounai/Jailbreak-R1
model· 6 dl· ♡ 5
6 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Crime Patterns and Interventions · Artificial Intelligence in Law