From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks

Zhexin Zhang; Junxiao Yang; Yida Lu; Pei Ke; Shiyao Cui; Chujie Zheng; Hongning Wang; Minlie Huang

arXiv:2407.02855·cs.CR·May 21, 2025·1 cites

From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks

Zhexin Zhang, Junxiao Yang, Yida Lu, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

PDF

Open Access 1 Repo 2 Models 3 Reviews

TL;DR

This paper investigates how unlearning harmful knowledge in large language models can lead to a ripple effect, implicitly reducing responses to unseen harmful queries and improving defense against jailbreak attacks.

Contribution

It introduces the concept of a ripple effect in unlearning, demonstrating its ability to generalize and reduce attack success rates on unseen data.

Findings

01

Unlearning reduces attack success rate from over 70% to below 10%.

02

Unlearning can implicitly unlearn related harmful knowledge not explicitly targeted.

03

The ripple effect is attributed to the relatedness of harmful responses in models.

Abstract

Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Consequently, unlearning-based approaches have been proposed to mitigate jailbreak attacks by directly removing harmful knowledge from the model. In this paper, we identify a novel ripple effect of unlearning, wherein LLMs can implicitly unlearn harmful knowledge that was not explicitly introduced during the unlearning phase (e.g., a model unlearning the steps for theft may also implicitly unlearn the steps for making a bomb). Through over 100 experimental runs spanning multiple models, attack strategies, and defense methods, we empirically validate this phenomenon,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 2

Strengths

1. The idea of unlearning-based defense is interesting. Intuitively, learning harmless response would unavoidably hurt the models' helpfulness while unlearning-based defenses do not suffer from this drawback. 2. The experimental results are convincing. Results in Section 4 exhibit clear improvement in the robustness of the models (measured by ASR).

Weaknesses

1. The related works on unlearning-based defenses (or simply machine unlearning) is not detailedly discussed.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper demonstrates the ripple effect in unlearning, where the model can also forget the unseen, harmful knowledge during unlearning. This insight is both novel and valuable for designing more generalizable jailbreak defenses. 2. The experimental coverage is strong: the paper evaluates multiple unlearning methods and provides a thoughtful analysis of why OOD harmful knowledge is suppressed.

Weaknesses

1. While the ripple effect is interesting, the paper largely stops at verifying this phenomenon. I am considering whether the contribution is enough. Basically, a further concrete method or optimization exploiting the ripple effect to yield a stronger or tunable jailbreak defense would greatly improve the contribution of the paper. 2. The key point -- harmful prompts or responses cluster in the latent space and thus enable OOD harmful forgetting -- is demonstrated primarily on a single benchma

Reviewer 03Rating 4Confidence 3

Strengths

- The finding (generalization of unlearning) is interesting. I can't comment on how novel ti is, but I think it is certaintly interesting and useful to know. It also suggests how we might want to modify unlearning approaches to more explicitly target information we might want to remove. I don't personally find the finding __that__ surprising. - Figure 2 with clear clustering is nice.

Weaknesses

- Improvement of exposition and writing - e.g., the first paragraph mixed many different points, jailbreaks, unlearning, SFT. Better to keep it one point per paragraph. - Improve related work - This work seems to be quite related to emergent misalignment (see https://arxiv.org/abs/2502.17424), but in fact the opposite direction. More of a form of emergent alignment. This should be mentioned. - Should also mention the MSJ paper (https://openreview.net/pdf?id=cw5mgd71jW), there they sh

Code & Models

Repositories

thu-coai/safeunlearning
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Criminal Law and Policy · Criminal Law and Evidence