Layered Unlearning for Adversarial Relearning
Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell

TL;DR
This paper introduces Layered Unlearning, a new method to improve the robustness of language models against adversarial relearning by creating layered inhibitory mechanisms during post-training modifications.
Contribution
We propose Layered Unlearning, an algorithm that enhances model robustness by limiting the ability to recover original data after unlearning, advancing machine unlearning techniques.
Findings
LU improves robustness to adversarial relearning.
LU creates distinct inhibitory mechanisms for data subsets.
Results show enhanced stability of language models after post-training.
Abstract
Our goal is to understand how post-training methods, such as fine-tuning, alignment, and unlearning, modify language model behavior and representations. We are particularly interested in the brittle nature of these modifications that makes them easy to bypass through prompt engineering or relearning. Recent results suggest that post-training induces shallow context-dependent ``circuits'' that suppress specific response patterns. This could be one explanation for the brittleness of post-training. To test this hypothesis, we design an unlearning algorithm, Layered Unlearning (LU), that creates distinct inhibitory mechanisms for a growing subset of the data. By unlearning the first folds while retaining the remaining at the th of stages, LU limits the ability of relearning on a subset of data to recover the full dataset. We evaluate LU through a combination of synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Geophysical Methods and Applications · Advanced SAR Imaging Techniques
