Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Xiaoyu Xu; Xiang Yue; Yang Liu; Qingqing Ye; Huadi Zheng; Peizhao Hu; Minxin Du; Haibo Hu

arXiv:2505.16831·cs.CL·May 19, 2026

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu

PDF

1 Repo 3 Reviews

TL;DR

This paper reveals that current metrics for evaluating unlearning in LLMs are misleading, as models can be easily reverted through fine-tuning, highlighting the need for representation-level analysis to assess true forgetting.

Contribution

The authors introduce a novel representation-level analysis framework to evaluate the reversibility of unlearning in LLMs, exposing limitations of existing metrics.

Findings

01

Models can appear to forget but are easily restored via fine-tuning.

02

Four distinct forgetting regimes are identified based on reversibility and catastrophicity.

03

Irreversible, non-catastrophic forgetting remains a significant challenge.

Abstract

Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity. We show that these metrics can be misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This \emph{reversibility} suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emph{representation-level analysis framework}. Our toolkit comprises PCA similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across multiple unlearning methods, data domains, and LLMs, we identify four distinct forgetting regimes based on their \emph{reversibility} and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The paper is well-written and easy to follow. - It clearly identifies the limitations of current task-level evaluations and proposes a **representation-level toolkit** that goes beyond surface metrics. - Provides clear definitions and a systematic taxonomy of forgetting regimes.

Weaknesses

- Table 2 demonstrates the weakness of task-level metrics, but it would be stronger to include results on the **Qwen2.5-7B** model to further consolidate this finding. - It remains unclear whether the same observations hold for **smaller (3B) or other model families (Llama)**. - The framework measures representational drift but does not formally assess **privacy leakage**; the notion of “irreversible forgetting” is still heuristic. - The proposed solution is interesting, but **cross-model valida

Reviewer 02Rating 4Confidence 4

Strengths

The idea of studying how easily unlearned knowledge can be recovered after unlearning is quite interesting. In particular, applying relearning and then evaluating the model’s recovery is a valuable direction that deserves further exploration.

Weaknesses

I don’t find the results of this paper particularly surprising. A single step of finetuning on the forget set can naturally bring back the forgotten knowledge. I don’t quite see why the authors expected this not to work. After all, with more aggressive settings (e.g., two or three additional epochs), one could almost certainly recover the utility on the forget set. Restoring performance through one epoch of finetuning is not unexpected. In general, unlearning methods that are truly “irreversibl

Reviewer 03Rating 2Confidence 4

Strengths

**(S1)** The paper successfully demonstrates that current methods do not achieve irreversible and non-catastrophic unlearning **(S2)** The introduced taxonomy may be helpful in future work to better systematize and discuss the achievements of new unlearning methods **(S3)** The paper makes a convincing argument that accuracy metrics alone give an insufficient impression of unlearning success, and analyzing the model's internal representations can give important insights beyond accuracy **(S4)

Weaknesses

**(W1)** One major concern is originality: The evaluation of unlearning methods successfully confirms that current methods do not achieve irreversible unlearning, but this is a known fact, as the paper also mentions (e.g., [24]). Likewise, the observation that models break down when applying multiple edits in continual learning has been reported before, e.g. [a]. Finally, none of the proposed metrics for representation analysis is novel. **(W2)** The paper does not contain any actionable insigh

Code & Models

Repositories

xiaoyuxu1/representational_analysis_tools
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education