Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Tom Tseng, Julius Broomfield, Adam Gleave, Kellin Pelrine

TL;DR
This paper reveals that fine-tuning language models can effectively remove safeguards, making them susceptible to harmful requests, and demonstrates that backdoors can enhance attack stealth and severity, highlighting urgent safety concerns.
Contribution
The paper introduces jailbreak-tuning, a method showing how models can be fine-tuned to bypass safeguards and execute harmful tasks, revealing vulnerabilities in current and future models.
Findings
Fine-tuning can fully remove safeguards from models.
Backdoors increase attack stealth and severity.
Recent models are more vulnerable to jailbreak attacks.
Abstract
AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models with safeguards destroyed. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks. Stronger jailbreak prompts become even more effective in fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Crime Patterns and Interventions
