Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack
Mingli Zhu, Siyuan Liang, Baoyuan Wu

TL;DR
This paper reveals that existing backdoor defenses leave dormant backdoors in models, which can be re-activated through subtle perturbations, exposing a critical vulnerability in current defense strategies.
Contribution
The study uncovers that backdoors persist in defended models and introduces novel re-activation attacks, including black-box methods, demonstrating a significant security flaw.
Findings
Dormant backdoors exist post-defense, measured by a new backdoor existence coefficient.
Dormant backdoors can be re-activated via tiny perturbations using universal adversarial attacks.
Re-activation methods are effective on both image classification and multimodal models like CLIP.
Abstract
Deep neural networks face persistent challenges in defending against backdoor attacks, leading to an ongoing battle between attacks and defenses. While existing backdoor defense strategies have shown promising performance on reducing attack success rates, can we confidently claim that the backdoor threat has truly been eliminated from the model? To address it, we re-investigate the characteristics of the backdoored models after defense (denoted as defense models). Surprisingly, we find that the original backdoors still exist in defense models derived from existing post-training defense strategies, and the backdoor existence is measured by a novel metric called backdoor existence coefficient. It implies that the backdoors just lie dormant rather than being eliminated. To further verify this finding, we empirically show that these dormant backdoors can be easily re-activated during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSecurity and Verification in Computing · Advanced Malware Detection Techniques · Physical Unclonable Functions (PUFs) and Hardware Security
MethodsContrastive Learning
