Towards Effective Evaluations and Comparisons for LLM Unlearning Methods

Qizhou Wang; Bo Han; Puning Yang; Jianing Zhu; Tongliang Liu; Masashi; Sugiyama

arXiv:2406.09179·cs.LG·February 26, 2025

Towards Effective Evaluations and Comparisons for LLM Unlearning Methods

Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, Masashi, Sugiyama

PDF

Open Access 3 Reviews

TL;DR

This paper improves the evaluation framework for LLM unlearning methods by developing robust metrics and calibration techniques, enabling more accurate assessment and comparison of unlearning effectiveness.

Contribution

It introduces a robust evaluation framework addressing metric reliability and trade-off calibration, advancing the assessment of LLM unlearning methods.

Findings

01

Identified vulnerabilities of current metrics under attack scenarios.

02

Proposed calibration method to isolate unlearning effectiveness.

03

Enhanced benchmarking capabilities for existing unlearning methods.

Abstract

The imperative to eliminate undesirable data memorization underscores the significance of machine unlearning for large language models (LLMs). Recent research has introduced a series of promising unlearning methods, notably boosting the practical significance of the field. Nevertheless, adopting a proper evaluation framework to reflect the true unlearning efficacy is also essential yet has not received adequate attention. This paper seeks to refine the evaluation of LLM unlearning by addressing two key challenges -- a) the robustness of evaluation metrics and b) the trade-offs between competing goals. The first challenge stems from findings that current metrics are susceptible to various red teaming scenarios. It indicates that they may not reflect the true extent of knowledge retained by LLMs but rather tend to mirror superficial model behaviors, thus prone to attacks. We address this…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 3Confidence 4

Strengths

S1: LLM unlearning and evaluation are important problems

Weaknesses

W1: Lack of technical contribution: I think most people working in this area would agree we need more metrics and benchmark datasets. However, this paper though goes into that direction, does not really provide enough meaningful and technical contribution in my view. The paper basically tried 4 popular unlearning methods on the TOFU datasets while proposing a calibration framework (See W2). This can mostly be done in leaderboard or in a measurement paper rather than a technical paper. And findin

Reviewer 02Rating 8Confidence 5

Strengths

1) Extensive empirical study 2) Proposed method for improving calibration via a general hypermater boosts performance of baseline methods to seemingly SOTA 3) Mostly well-written

Weaknesses

1) I found that certain parts of the draft could have been clearer about the benefits of model mixing. The draft does not discuss alternative calibration of retain performance, which naively could have also been done with just a sweep of the unlearning method hyperparameters. So at first I thought this was an empirical limitation. But after thinking about it I realized this is actually okay as the performance of calibration with just a hyperparameter sweep of the unlearning method is subsumed by

Reviewer 03Rating 5Confidence 3

Strengths

The paper have a comprehensive view of different unlearning evaluation methods and approaches them in a systematic manner from robustness and utility trade-offs. The paper proposes a novel approach unlearning with control to better calibrate the trade-off between unlearning effectiveness and retain performance with model-mixing, which is a simple but effective mechanism.

Weaknesses

There is a lack of justification for selecting the metric: Why does the PCC measure the metrics' robustness again attacks? In Figure 2, the plot is characterized by the test static before and after the attack for different methods, models, and forget set ratio. Why should we assume there is a linear correlation among them? In addition, the paper uses TOFU as unlearning dataset/task, but does not survey the metric used in the TOFU paper (truth ratio). Weak/unclear attack methods: it is unclear t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques · Text Readability and Simplification

MethodsSparse Evolutionary Training