Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning   Methods

Jai Doshi; Asa Cooper Stickland

arXiv:2411.12103·cs.CL·February 25, 2025

Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

Jai Doshi, Asa Cooper Stickland

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper critically evaluates the effectiveness of current LLM unlearning methods, revealing they often fail to truly unlearn information and are vulnerable to simple rephrasing or additional training.

Contribution

The study introduces a comprehensive evaluation framework for LLM unlearning methods and demonstrates their limitations in genuinely erasing learned information.

Findings

01

Unlearning impacts model performance more in LLMU than RMU.

02

Simple prompts or rephrasing can significantly increase unlearning benchmark accuracy.

03

Training on unrelated data can nearly recover pre-unlearning performance.

Abstract

Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes. LLMU and RMU have been proposed as two methods for LLM unlearning, achieving impressive results on unlearning benchmarks. We study in detail the impact of unlearning on LLM performance metrics using the WMDP dataset as well as a new biology dataset we create. We show that unlearning has a notable impact on general model capabilities, with the performance degradation being more significant in general for LLMU. We further test the robustness of the two methods and find that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that training on unrelated data can almost completely recover pre-unlearning performance, demonstrating that these methods fail at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jaidoshi/knowledge-erasure
pytorchOfficial

Datasets

jd5697/wikipedia-biology
dataset· 67 dl
67 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment