Potion: Towards Poison Unlearning
Stefan Schoepf, Jack Foster, Alexandra Brintrup

TL;DR
This paper introduces a novel poison unlearning method that effectively removes poisoned data from trained models, outperforming existing techniques in accuracy and efficiency, especially when the poisoned data subset is unknown or contaminated.
Contribution
We propose a new outlier-resistant unlearning approach and a hyperparameter search method, Poison Trigger Neutralisation (PTN), to improve poison unlearning when the poisoned subset is partially unknown.
Findings
Our method heals 93.72% of poison compared to SSD's 83.41%.
Model accuracy drop is reduced from 5.68% to 1.41%.
Outperforms full retraining in effectiveness and efficiency.
Abstract
Adversarial attacks by malicious actors on machine learning systems, such as introducing poison triggers into training datasets, pose significant risks. The challenge in resolving such an attack arises in practice when only a subset of the poisoned data can be identified. This necessitates the development of methods to remove, i.e. unlearn, poison triggers from already trained models with only a subset of the poison data available. The requirements for this task significantly deviate from privacy-focused unlearning where all of the data to be forgotten by the model is known. Previous work has shown that the undiscovered poisoned samples lead to a failure of established unlearning methods, with only one method, Selective Synaptic Dampening (SSD), showing limited success. Even full retraining, after the removal of the identified poison, cannot address this challenge as the undiscovered…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPoisoning and overdose treatments
MethodsSparse Evolutionary Training
