Towards Evaluation for Real-World LLM Unlearning
Ke Miao, Yuke Hu, Xiaochen Li, Wenjie Bao, Zhihao Liu, Zhan Qin, Kui Ren

TL;DR
This paper introduces DCUE, a new evaluation metric for large language model unlearning that improves practicality, accuracy, and robustness by correcting distribution biases and using statistical testing.
Contribution
We propose DCUE, a novel unlearning evaluation metric that addresses limitations of existing metrics through core token identification and distribution bias correction.
Findings
DCUE outperforms existing metrics in real-world scenarios
DCUE guides the development of more reliable unlearning algorithms
Experimental results validate the effectiveness of DCUE
Abstract
This paper analyzes the limitations of existing unlearning evaluation metrics in terms of practicality, exactness, and robustness in real-world LLM unlearning scenarios. To overcome these limitations, we propose a new metric called Distribution Correction-based Unlearning Evaluation (DCUE). It identifies core tokens and corrects distributional biases in their confidence scores using a validation set. The evaluation results are quantified using the Kolmogorov-Smirnov test. Experimental results demonstrate that DCUE overcomes the limitations of existing metrics, which also guides the design of more practical and reliable unlearning algorithms in the future.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper identifies limitations in existing machine unlearning evaluation metrics and introduces a novel metric to address them. Furthermore, it evaluates current unlearning algorithms, providing valuable insights for the design of future methods.
1. The authors' definition of robustness seems to be flawed. Specifically, the authors claim that *If the post-processing operations are independent of $D_f$ , such as PostProft: $M_u$ undergoes fine-tuning on $D_u$, where $D_u \cap D_f = \varnothing$*. This robustness property fails to account for the fact that such fine-tuning on $D_u$ can act as a form of knowledge dilution, systemically reducing a model's memorization of $D_f$ [1]. A change in the evaluation score in this case does not indic
S1: I think that this paper gets the most important thing right: observing the issues with current evals and working to fix them. Although I think that table 1 needs much more explanation thatn it current has (See below), I believe that it reflects a major truth about unlearning evals all having problems. S2: I think that the end of section 5 gives needed and unarguably correct wisdom. Framing the paper around that core idea was a good choice.
W1: I don't think that practicality, exactness, and robustness capture all the desiderata from unlearning well. Unlearning means different things in different contexts. Some people define it as erasing the influence of data. Others as removing a capability. In different cases, people care about robustness to different types of adversarial and non-adversarial manipulations to inputs or model parameters. I don't think that this paper is making useful conceptual progress by broadly defining the obj
This work addresses an important and timely topic concerning the evaluation metrics for LLM unlearning. It is true that conventional NLP metrics may not adequately reflect a model’s actual performance after unlearning, and highlighting this gap is a meaningful contribution. DCUE demonstrates effectiveness by explicitly accounting for key tokens and robustness to post-unlearning variations. The approach is well aligned with human judgment, making it a more practical and realistic evaluation metr
1. From a motivational standpoint, DCUE is inspired by the recognition of key tokens as critical elements in unlearning evaluation. Nevertheless, several relevant studies [1][2][3] that have discussed the importance of key tokens are missing from the references. Although this does not diminish DCUE’s originality as a metric incorporating key-token effects, acknowledging those works would provide a more complete and accurate positioning of the paper. These prior studies explored the role of key t
DCUE works without access to a retrained baseline, aligning with real-world deployment constraints.
There exist numerous prior studies addressing the same problem, many of which were proposed one or two years earlier. It is therefore difficult to identify clear novelty or significance in this paper. In comparison, the contribution appears incremental and considerably weaker than that of the referenced works. [1] Position: LLM Unlearning Benchmarks are Weak Measures of Progress [2] BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap [3] OpenUnlearning: Accelerating LLM Unlea
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Imbalanced Data Classification Techniques · Data Stream Mining Techniques
