Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via   Benign Relearning

Shengyuan Hu; Yiwei Fu; Zhiwei Steven Wu; Virginia Smith

arXiv:2406.13356·cs.LG·March 18, 2025

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning

Shengyuan Hu, Yiwei Fu, Zhiwei Steven Wu, Virginia Smith

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that current machine unlearning methods in large language models are vulnerable to benign relearning attacks, which can reverse unlearning effects and recover memorized information using minimal, loosely related data.

Contribution

The study formalizes the unlearning-relearning pipeline, evaluates its vulnerability across benchmarks, and highlights the limitations of existing unlearning techniques in LLMs.

Findings

01

Relearning can reverse unlearning effects in LLMs.

02

Unlearning methods often only suppress outputs without truly forgetting.

03

Benign relearning can recover harmful or memorized knowledge.

Abstract

Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of $benign relearning attacks$ . With access to only a small and potentially loosely related set of data, we find that we can ''jog'' the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

s-huu/jog_llm_memory
pytorchOfficial

Videos

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning· slideslive

Taxonomy

TopicsBrain Tumor Detection and Classification · Fire Detection and Safety Systems

MethodsSparse Evolutionary Training