Backdoor Attacks on the DNN Interpretation System
Shihong Fang, Anna Choromanska

TL;DR
This paper introduces backdoor attacks on DNN interpretation systems that manipulate saliency maps via invisible triggers, posing security risks and enabling controlled access to explanations.
Contribution
It presents novel backdoor attack methods targeting saliency map explanations, demonstrating their effectiveness across multiple interpretation techniques and architectures.
Findings
Attacks successfully alter saliency maps without affecting model predictions.
The attacks pose significant security threats to DNN interpretability.
Method can be inverted to restrict interpretation access to authorized triggers.
Abstract
Interpretability is crucial to understand the inner workings of deep neural networks (DNNs) and many interpretation methods generate saliency maps that highlight parts of the input image that contribute the most to the prediction made by the DNN. In this paper we design a backdoor attack that alters the saliency map produced by the network for an input image only with injected trigger that is invisible to the naked eye while maintaining the prediction accuracy. The attack relies on injecting poisoned data with a trigger into the training data set. The saliency maps are incorporated in the penalty term of the objective function that is used to train a deep model and its influence on model training is conditioned upon the presence of a trigger. We design two types of attacks: targeted attack that enforces a specific modification of the saliency map and untargeted attack when the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
