Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
Rui Min, Zeyu Qin, Nevin L. Zhang, Li Shen, Minhao Cheng

TL;DR
This paper investigates the robustness of backdoor defenses in neural networks, revealing vulnerabilities to rapid re-learning and proposing a new tuning method, PAM, to enhance post-purification safety without sacrificing accuracy.
Contribution
It uncovers limitations of current backdoor purification methods and introduces Path-Aware Minimization (PAM) to improve robustness against reactivation attacks.
Findings
Current purification methods are vulnerable to re-learning backdoors.
Query-based Reactivation Attack (QRA) can effectively reactivate backdoors.
PAM significantly improves robustness while maintaining accuracy.
Abstract
Backdoor attacks pose a significant threat to Deep Neural Networks (DNNs) as they allow attackers to manipulate model predictions with backdoor triggers. To address these security vulnerabilities, various backdoor purification methods have been proposed to purify compromised models. Typically, these purified models exhibit low Attack Success Rates (ASR), rendering them resistant to backdoored inputs. However, Does achieving a low ASR through current safety purification methods truly eliminate learned backdoor features from the pretraining phase? In this paper, we provide an affirmative answer to this question by thoroughly investigating the Post-Purification Robustness of current backdoor purification methods. We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior, even when further fine-tuning of purified models is performed using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCombustion and Detonation Processes · Safety Systems Engineering in Autonomy · Risk and Safety Analysis
