JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification
Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Zhaoye Li, Bin Ji, Baosheng Wang, Jie Yu

TL;DR
This paper introduces JPU, a novel method that improves LLM safety by dynamically identifying and rectifying jailbreak paths, effectively resisting diverse jailbreak attacks while maintaining model utility.
Contribution
JPU is the first approach to dynamically unlearn jailbreak paths by mining on-policy adversarial samples, bridging the gap in existing defenses.
Findings
JPU significantly improves jailbreak resistance against dynamic attacks.
JPU preserves the utility of the language model.
Empirical results show JPU outperforms existing unlearning methods.
Abstract
Despite extensive safety alignment, Large Language Models (LLMs) often fail against jailbreak attacks. While machine unlearning has emerged as a promising defense by erasing specific harmful parameters, current methods remain vulnerable to diverse jailbreaks. We first conduct an empirical study and discover that this failure mechanism is caused by jailbreaks primarily activating non-erased parameters in the intermediate layers. Further, by probing the underlying mechanism through which these circumvented parameters reassemble into the prohibited output, we verify the persistent existence of dynamic and show that the inability to rectify them constitutes the fundamental gap in existing unlearning defenses. To bridge this gap, we propose ailbreak ath nlearning (JPU), which is the first to rectify dynamic jailbreak paths…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Ethics and Social Impacts of AI
