Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods
Jai Doshi, Asa Cooper Stickland

TL;DR
This paper critically evaluates the effectiveness of current LLM unlearning methods, revealing they often fail to truly unlearn information and are vulnerable to simple rephrasing or additional training.
Contribution
The study introduces a comprehensive evaluation framework for LLM unlearning methods and demonstrates their limitations in genuinely erasing learned information.
Findings
Unlearning impacts model performance more in LLMU than RMU.
Simple prompts or rephrasing can significantly increase unlearning benchmark accuracy.
Training on unrelated data can nearly recover pre-unlearning performance.
Abstract
Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes. LLMU and RMU have been proposed as two methods for LLM unlearning, achieving impressive results on unlearning benchmarks. We study in detail the impact of unlearning on LLM performance metrics using the WMDP dataset as well as a new biology dataset we create. We show that unlearning has a notable impact on general model capabilities, with the performance degradation being more significant in general for LLMU. We further test the robustness of the two methods and find that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that training on unrelated data can almost completely recover pre-unlearning performance, demonstrating that these methods fail at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Assessment
