Do Unlearning Methods Remove Information from Language Model Weights?

Aghyad Deeb; Fabien Roger

arXiv:2410.08827·cs.LG·February 10, 2025

Do Unlearning Methods Remove Information from Language Model Weights?

Aghyad Deeb, Fabien Roger

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces an adversarial evaluation method to determine if unlearning techniques truly remove sensitive information from language model weights, revealing that current methods often only obscure access rather than eliminate the information.

Contribution

The paper proposes a novel adversarial evaluation approach to assess whether unlearning methods effectively remove information from model weights, highlighting their limitations.

Findings

01

Fine-tuning on accessible facts can recover 88% of pre-unlearning accuracy.

02

Current unlearning methods often only obscure information, not remove it.

03

Evaluations based on fine-tuning may overestimate unlearning robustness.

Abstract

Large Language Models' knowledge of how to perform cyber-security attacks, create bioweapons, and manipulate humans poses risks of misuse. Previous work has proposed methods to unlearn this knowledge. Historically, it has been unclear whether unlearning techniques are removing information from the model weights or just making it harder to access. To disentangle these two objectives, we propose an adversarial evaluation method to test for the removal of information from model weights: we give an attacker access to some facts that were supposed to be removed, and using those, the attacker tries to recover other facts from the same distribution that cannot be guessed from the accessible facts. We show that using fine-tuning on the accessible facts can recover 88% of the pre-unlearning accuracy when applied to current unlearning methods for information learned during pretraining, revealing…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The paper highlights a critical AI safety issue: whether unlearning methods ensures information in LLMs, especially harmful content and senstive information, is effectively removed, not just hidden. By quantitatively assessing the distinction between "removal" and "hiding", the paper addresses key concerns in the field. - The proposed framework, which includes the RTT method and a dataset encompassing various types of knowledge with well-defined metrics, enables a quantitative evaluation of th

Weaknesses

- The paper could evaluate more unlearning methods such as other baselines listed in [1]. - The paper could incorporate more types of knowledge assessment tasks besides MCQ. - The paper could give more detailed explanation of the experiment setting, e.g., the different text formats. - Some presentation is a bit misleading as listed in questions. [1] TOFU: A Task of Fictitious Unlearning for LLMs

Reviewer 02Rating 3Confidence 5

Strengths

- The paper is well-written, and easy to follow. - The authors' methodology for dataset curation was sound - The paper tackles an important area, which is crafting better evaluations for unlearning methods to understand their limitations

Weaknesses

- Limited technical novelty/contributions. As prior work has already proposed relearning as a metric for evaluating unlearning [1, 2, 3, 4, 5], the paper's contribution rests solely on creating training and validation splits with low mutual information for the relearning evaluation. For example, prior work already shows that RMU is not robust to relearning [1]. - The authors claim that "Our evaluation does not guarantee that information is removed from the weights; rather, it sets a higher bar t

Reviewer 03Rating 8Confidence 4

Strengths

1. The findings are interesting. Machine unlearning is a hot topic, especially in the era of generative AI as privacy matters more. I believe the contribution is sufficient for a publication. 2. Concepts are formally introduced or defined. The structure of the paper is clear. 3. Experiments are performed across various settings, and the empirical evidence sufficiently supports the claims made.

Weaknesses

The real-world use case of the RTT approach remains unclear.

Code & Models

Repositories

aghyad-deeb/unlearning_evaluation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling