Jailbreaking to Jailbreak

Jeremy Kritz; Vaughn Robinson; Robert Vacareanu; Bijan Varjavand; Michael Choi; Bobby Gogov; Scale Red Team; Summer Yue; Willow E. Primack; Zifan Wang

arXiv:2502.09638·cs.CL·May 30, 2025

Jailbreaking to Jailbreak

Jeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan Wang

PDF

Open Access

TL;DR

This paper introduces a novel method to turn black-box large language models into jailbreaking attackers, revealing their strong capabilities and transferability in bypassing model safeguards, with implications for AI safety.

Contribution

The work demonstrates that almost any black-box LLM can be transformed into an effective jailbreaking attacker, surpassing existing methods and matching expert human red teamers in success rates.

Findings

01

$J_2$ attackers transfer across models.

02

$J_2$ attackers can jailbreak themselves.

03

Vulnerability of $J_2$ attackers has increased over 12 months.

Abstract

Large Language Models (LLMs) can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-training of strong LLMs (e.g. OpenAI o3) refuse to help jailbreaking, our work turn (almost) any black-box LLMs into attackers. The resulting $J_{2}$ (jailbreaking-to-jailbreak) attackers can effectively jailbreak the safeguard of target models using various strategies, both created by themselves or from expert human red teamers. In doing so, we show their strong but under-researched jailbreaking capabilities. Our experiments demonstrate that 1) prompts used to create $J_{2}$ attackers transfer across almost all black-box models; 2) an $J_{2}$ attacker can jailbreak a copy of itself, and this vulnerability develops rapidly over the past 12 months; 3) reasong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Cybercrime and Law Enforcement Studies · Forensic and Genetic Research