From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization
Shoaib Ahmed Siddiqui, Adrian Weller, David Krueger, Gintare Karolina Dziugaite, Michael Curtis Mozer, Eleni Triantafillou

TL;DR
This paper investigates the vulnerability of unlearning methods in vision classifiers to relearning attacks, revealing that resistance can be predicted by weight-space properties and proposing new methods to enhance tamper resistance.
Contribution
The paper uncovers the role of weight-space properties in unlearning resistance and introduces novel methods that significantly improve tamper resistance against relearning attacks.
Findings
Forget-set accuracy can recover to nearly 100% without examples of the forget set.
Resistance to relearning correlates with $L_2$-distance and linear mode connectivity.
New methods achieve state-of-the-art resistance to relearning attacks.
Abstract
Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, -distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Physical Unclonable Functions (PUFs) and Hardware Security
