Backdoor Mitigation by Correcting the Distribution of Neural Activations
Xi Li, Zhen Xiang, David J. Miller, George Kesidis

TL;DR
This paper identifies that backdoor attacks alter neural activation distributions and proposes a post-training mitigation method that corrects these distributions without retraining the model, improving detection and defense.
Contribution
The paper introduces a novel backdoor mitigation technique based on correcting activation distribution shifts, avoiding retraining and enhancing detection capabilities.
Findings
Effective backdoor mitigation without retraining.
Improved detection of trigger instances.
Outperforms existing methods in mitigation performance.
Abstract
Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever the attacker's backdoor trigger is present. In this paper, we reveal and analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances, compared to that for clean instances. Even more importantly, we find that instances with the backdoor trigger will be correctly classified to their original source classes if this distribution alteration is corrected. Based on our observations, we propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration using reverse-engineered triggers. Notably, our method does not change any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
