JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification

Xi Wang; Songlei Jian; Shasha Li; Xiaopeng Li; Zhaoye Li; Bin Ji; Baosheng Wang; Jie Yu

arXiv:2601.03005·cs.CR·January 7, 2026

JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification

Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Zhaoye Li, Bin Ji, Baosheng Wang, Jie Yu

PDF

Open Access

TL;DR

This paper introduces JPU, a novel method that improves LLM safety by dynamically identifying and rectifying jailbreak paths, effectively resisting diverse jailbreak attacks while maintaining model utility.

Contribution

JPU is the first approach to dynamically unlearn jailbreak paths by mining on-policy adversarial samples, bridging the gap in existing defenses.

Findings

01

JPU significantly improves jailbreak resistance against dynamic attacks.

02

JPU preserves the utility of the language model.

03

Empirical results show JPU outperforms existing unlearning methods.

Abstract

Despite extensive safety alignment, Large Language Models (LLMs) often fail against jailbreak attacks. While machine unlearning has emerged as a promising defense by erasing specific harmful parameters, current methods remain vulnerable to diverse jailbreaks. We first conduct an empirical study and discover that this failure mechanism is caused by jailbreaks primarily activating non-erased parameters in the intermediate layers. Further, by probing the underlying mechanism through which these circumvented parameters reassemble into the prohibited output, we verify the persistent existence of dynamic $jailbreak paths$ and show that the inability to rectify them constitutes the fundamental gap in existing unlearning defenses. To bridge this gap, we propose $J$ ailbreak $P$ ath $U$ nlearning (JPU), which is the first to rectify dynamic jailbreak paths…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Ethics and Social Impacts of AI