Jailbreaking to Jailbreak
Jeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan Wang

TL;DR
This paper introduces a novel method to turn black-box large language models into jailbreaking attackers, revealing their strong capabilities and transferability in bypassing model safeguards, with implications for AI safety.
Contribution
The work demonstrates that almost any black-box LLM can be transformed into an effective jailbreaking attacker, surpassing existing methods and matching expert human red teamers in success rates.
Findings
$J_2$ attackers transfer across models.
$J_2$ attackers can jailbreak themselves.
Vulnerability of $J_2$ attackers has increased over 12 months.
Abstract
Large Language Models (LLMs) can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-training of strong LLMs (e.g. OpenAI o3) refuse to help jailbreaking, our work turn (almost) any black-box LLMs into attackers. The resulting (jailbreaking-to-jailbreak) attackers can effectively jailbreak the safeguard of target models using various strategies, both created by themselves or from expert human red teamers. In doing so, we show their strong but under-researched jailbreaking capabilities. Our experiments demonstrate that 1) prompts used to create attackers transfer across almost all black-box models; 2) an attacker can jailbreak a copy of itself, and this vulnerability develops rapidly over the past 12 months; 3) reasong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Cybercrime and Law Enforcement Studies · Forensic and Genetic Research
