Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan, Hadfield-Menell

TL;DR
This paper reviews and applies various evaluation methods to assess the effectiveness of unlearning in large language models, revealing strengths and limitations of current approaches.
Contribution
It provides a comprehensive survey of unlearning evaluation techniques and applies them to a specific model, highlighting the need for standardized, thorough assessment methods.
Findings
WHP's unlearning generalizes well with the 'Familiarity' metric
Higher-than-baseline knowledge can be extracted from WHP
Collateral unlearning occurs in related domains
Abstract
Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). While WHP's unlearning generalizes well when evaluated with the "Familiarity" metric from Eldan and Russinovich, we find i) higher-than-baseline amounts of knowledge can reliably be extracted, ii) WHP performs on par with the original model on Harry Potter Q&A tasks, iii) it represents latent knowledge comparably to the original model, and iv) there is collateral unlearning in related domains. Overall, our results highlight the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Quality and Safety in Healthcare
MethodsSparse Evolutionary Training
