NeuronInspect: Detecting Backdoors in Neural Networks via Output Explanations
Xijie Huang, Moustafa Alzantot, Mani Srivastava

TL;DR
NeuronInspect is a novel framework that uses output explanation heatmaps and feature analysis to detect trojan backdoors in neural networks, demonstrating superior robustness and effectiveness over existing methods.
Contribution
It introduces a new explanation-based approach for backdoor detection, combining heatmap analysis and outlier detection to identify attack targets.
Findings
Effective detection on MNIST and GTSRB datasets
Outperforms Neural Cleanse in robustness and accuracy
Applicable to various attack scenarios
Abstract
Deep neural networks have achieved state-of-the-art performance on various tasks. However, lack of interpretability and transparency makes it easier for malicious attackers to inject trojan backdoor into the neural networks, which will make the model behave abnormally when a backdoor sample with a specific trigger is input. In this paper, we propose NeuronInspect, a framework to detect trojan backdoors in deep neural networks via output explanation techniques. NeuronInspect first identifies the existence of backdoor attack targets by generating the explanation heatmap of the output layer. We observe that generated heatmaps from clean and backdoored models have different characteristics. Therefore we extract features that measure the attributes of explanations from an attacked model namely: sparse, smooth and persistent. We combine these features and use outlier detection to figure out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications
MethodsInterpretability · Heatmap
