Do Unlearning Methods Remove Information from Language Model Weights?
Aghyad Deeb, Fabien Roger

TL;DR
This paper introduces an adversarial evaluation method to determine if unlearning techniques truly remove sensitive information from language model weights, revealing that current methods often only obscure access rather than eliminate the information.
Contribution
The paper proposes a novel adversarial evaluation approach to assess whether unlearning methods effectively remove information from model weights, highlighting their limitations.
Findings
Fine-tuning on accessible facts can recover 88% of pre-unlearning accuracy.
Current unlearning methods often only obscure information, not remove it.
Evaluations based on fine-tuning may overestimate unlearning robustness.
Abstract
Large Language Models' knowledge of how to perform cyber-security attacks, create bioweapons, and manipulate humans poses risks of misuse. Previous work has proposed methods to unlearn this knowledge. Historically, it has been unclear whether unlearning techniques are removing information from the model weights or just making it harder to access. To disentangle these two objectives, we propose an adversarial evaluation method to test for the removal of information from model weights: we give an attacker access to some facts that were supposed to be removed, and using those, the attacker tries to recover other facts from the same distribution that cannot be guessed from the accessible facts. We show that using fine-tuning on the accessible facts can recover 88% of the pre-unlearning accuracy when applied to current unlearning methods for information learned during pretraining, revealing…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper highlights a critical AI safety issue: whether unlearning methods ensures information in LLMs, especially harmful content and senstive information, is effectively removed, not just hidden. By quantitatively assessing the distinction between "removal" and "hiding", the paper addresses key concerns in the field. - The proposed framework, which includes the RTT method and a dataset encompassing various types of knowledge with well-defined metrics, enables a quantitative evaluation of th
- The paper could evaluate more unlearning methods such as other baselines listed in [1]. - The paper could incorporate more types of knowledge assessment tasks besides MCQ. - The paper could give more detailed explanation of the experiment setting, e.g., the different text formats. - Some presentation is a bit misleading as listed in questions. [1] TOFU: A Task of Fictitious Unlearning for LLMs
- The paper is well-written, and easy to follow. - The authors' methodology for dataset curation was sound - The paper tackles an important area, which is crafting better evaluations for unlearning methods to understand their limitations
- Limited technical novelty/contributions. As prior work has already proposed relearning as a metric for evaluating unlearning [1, 2, 3, 4, 5], the paper's contribution rests solely on creating training and validation splits with low mutual information for the relearning evaluation. For example, prior work already shows that RMU is not robust to relearning [1]. - The authors claim that "Our evaluation does not guarantee that information is removed from the weights; rather, it sets a higher bar t
1. The findings are interesting. Machine unlearning is a hot topic, especially in the era of generative AI as privacy matters more. I believe the contribution is sufficient for a publication. 2. Concepts are formally introduced or defined. The structure of the paper is clear. 3. Experiments are performed across various settings, and the empirical evidence sufficiently supports the claims made.
The real-world use case of the RTT approach remains unclear.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
