Improving LLM Unlearning Robustness via Random Perturbations
Dang Huu-Tien, Hoang Thanh-Tung, Anh Bui, Minh-Phuong Nguyen, Le-Minh Nguyen, and Naoya Inoue

TL;DR
This paper reveals that current LLM unlearning methods inadvertently introduce backdoor vulnerabilities, and proposes a simple, effective noise-based technique to enhance robustness and mitigate these issues.
Contribution
The paper introduces a theoretical framework linking unlearning to backdoor attacks and defenses, and proposes Random Noise Augmentation (RNA) to improve unlearning robustness.
Findings
RNA significantly improves robustness of unlearned models
Unlearning methods can unintentionally embed backdoor triggers
RNA preserves model performance on forgetting and retaining tasks
Abstract
Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
