Towards Effective Evaluations and Comparisons for LLM Unlearning Methods
Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, Masashi, Sugiyama

TL;DR
This paper improves the evaluation framework for LLM unlearning methods by developing robust metrics and calibration techniques, enabling more accurate assessment and comparison of unlearning effectiveness.
Contribution
It introduces a robust evaluation framework addressing metric reliability and trade-off calibration, advancing the assessment of LLM unlearning methods.
Findings
Identified vulnerabilities of current metrics under attack scenarios.
Proposed calibration method to isolate unlearning effectiveness.
Enhanced benchmarking capabilities for existing unlearning methods.
Abstract
The imperative to eliminate undesirable data memorization underscores the significance of machine unlearning for large language models (LLMs). Recent research has introduced a series of promising unlearning methods, notably boosting the practical significance of the field. Nevertheless, adopting a proper evaluation framework to reflect the true unlearning efficacy is also essential yet has not received adequate attention. This paper seeks to refine the evaluation of LLM unlearning by addressing two key challenges -- a) the robustness of evaluation metrics and b) the trade-offs between competing goals. The first challenge stems from findings that current metrics are susceptible to various red teaming scenarios. It indicates that they may not reflect the true extent of knowledge retained by LLMs but rather tend to mirror superficial model behaviors, thus prone to attacks. We address this…
Peer Reviews
Decision·ICLR 2025 Poster
S1: LLM unlearning and evaluation are important problems
W1: Lack of technical contribution: I think most people working in this area would agree we need more metrics and benchmark datasets. However, this paper though goes into that direction, does not really provide enough meaningful and technical contribution in my view. The paper basically tried 4 popular unlearning methods on the TOFU datasets while proposing a calibration framework (See W2). This can mostly be done in leaderboard or in a measurement paper rather than a technical paper. And findin
1) Extensive empirical study 2) Proposed method for improving calibration via a general hypermater boosts performance of baseline methods to seemingly SOTA 3) Mostly well-written
1) I found that certain parts of the draft could have been clearer about the benefits of model mixing. The draft does not discuss alternative calibration of retain performance, which naively could have also been done with just a sweep of the unlearning method hyperparameters. So at first I thought this was an empirical limitation. But after thinking about it I realized this is actually okay as the performance of calibration with just a hyperparameter sweep of the unlearning method is subsumed by
The paper have a comprehensive view of different unlearning evaluation methods and approaches them in a systematic manner from robustness and utility trade-offs. The paper proposes a novel approach unlearning with control to better calibrate the trade-off between unlearning effectiveness and retain performance with model-mixing, which is a simple but effective mechanism.
There is a lack of justification for selecting the metric: Why does the PCC measure the metrics' robustness again attacks? In Figure 2, the plot is characterized by the test static before and after the attack for different methods, models, and forget set ratio. Why should we assume there is a linear correlation among them? In addition, the paper uses TOFU as unlearning dataset/task, but does not survey the metric used in the TOFU paper (truth ratio). Weak/unclear attack methods: it is unclear t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques · Text Readability and Simplification
MethodsSparse Evolutionary Training
