Layered Unlearning for Adversarial Relearning

Timothy Qian; Vinith Suriyakumar; Ashia Wilson; Dylan Hadfield-Menell

arXiv:2505.09500·cs.LG·May 15, 2025

Layered Unlearning for Adversarial Relearning

Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell

PDF

Open Access 1 Repo

TL;DR

This paper introduces Layered Unlearning, a new method to improve the robustness of language models against adversarial relearning by creating layered inhibitory mechanisms during post-training modifications.

Contribution

We propose Layered Unlearning, an algorithm that enhances model robustness by limiting the ability to recover original data after unlearning, advancing machine unlearning techniques.

Findings

01

LU improves robustness to adversarial relearning.

02

LU creates distinct inhibitory mechanisms for data subsets.

03

Results show enhanced stability of language models after post-training.

Abstract

Our goal is to understand how post-training methods, such as fine-tuning, alignment, and unlearning, modify language model behavior and representations. We are particularly interested in the brittle nature of these modifications that makes them easy to bypass through prompt engineering or relearning. Recent results suggest that post-training induces shallow context-dependent ``circuits'' that suppress specific response patterns. This could be one explanation for the brittleness of post-training. To test this hypothesis, we design an unlearning algorithm, Layered Unlearning (LU), that creates distinct inhibitory mechanisms for a growing subset of the data. By unlearning the first $i$ folds while retaining the remaining $k - i$ at the $i$ th of $k$ stages, LU limits the ability of relearning on a subset of data to recover the full dataset. We evaluate LU through a combination of synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

12tqian/layered-unlearning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Geophysical Methods and Applications · Advanced SAR Imaging Techniques