Eight Methods to Evaluate Robust Unlearning in LLMs

Aengus Lynch; Phillip Guo; Aidan Ewart; Stephen Casper; Dylan; Hadfield-Menell

arXiv:2402.16835·cs.CL·February 27, 2024·1 cites

Eight Methods to Evaluate Robust Unlearning in LLMs

Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan, Hadfield-Menell

PDF

Open Access

TL;DR

This paper reviews and applies various evaluation methods to assess the effectiveness of unlearning in large language models, revealing strengths and limitations of current approaches.

Contribution

It provides a comprehensive survey of unlearning evaluation techniques and applies them to a specific model, highlighting the need for standardized, thorough assessment methods.

Findings

01

WHP's unlearning generalizes well with the 'Familiarity' metric

02

Higher-than-baseline knowledge can be extracted from WHP

03

Collateral unlearning occurs in related domains

Abstract

Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). While WHP's unlearning generalizes well when evaluated with the "Familiarity" metric from Eldan and Russinovich, we find i) higher-than-baseline amounts of knowledge can reliably be extracted, ii) WHP performs on par with the original model on Harry Potter Q&A tasks, iii) it represents latent knowledge comparably to the original model, and iv) there is collateral unlearning in related domains. Overall, our results highlight the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Quality and Safety in Healthcare

MethodsSparse Evolutionary Training