Improving LLM Unlearning Robustness via Random Perturbations

Dang Huu-Tien; Hoang Thanh-Tung; Anh Bui; Minh-Phuong Nguyen; Le-Minh Nguyen; and Naoya Inoue

arXiv:2501.19202·cs.CL·April 21, 2026

Improving LLM Unlearning Robustness via Random Perturbations

Dang Huu-Tien, Hoang Thanh-Tung, Anh Bui, Minh-Phuong Nguyen, Le-Minh Nguyen, and Naoya Inoue

PDF

TL;DR

This paper reveals that current LLM unlearning methods inadvertently introduce backdoor vulnerabilities, and proposes a simple, effective noise-based technique to enhance robustness and mitigate these issues.

Contribution

The paper introduces a theoretical framework linking unlearning to backdoor attacks and defenses, and proposes Random Noise Augmentation (RNA) to improve unlearning robustness.

Findings

01

RNA significantly improves robustness of unlearned models

02

Unlearning methods can unintentionally embed backdoor triggers

03

RNA preserves model performance on forgetting and retaining tasks

Abstract

Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.