Making Corgis Important for Honeycomb Classification: Adversarial Attacks on Concept-based Explainability Tools
Davis Brown, Henry Kvinge

TL;DR
This paper reveals that concept-based interpretability methods like TCAV and faceted feature visualization are vulnerable to adversarial attacks, which can manipulate their explanations and compromise model interpretability in safety-critical applications.
Contribution
It demonstrates for the first time that concept-based interpretability tools can be attacked adversarially, showing how to manipulate explanations to mislead or obscure model reasoning.
Findings
Adversarial perturbations can radically alter interpretability outputs.
Attacks can produce false positive or negative concept importance.
Vulnerability poses risks for safety-critical AI applications.
Abstract
Methods for model explainability have become increasingly critical for testing the fairness and soundness of deep learning. Concept-based interpretability techniques, which use a small set of human-interpretable concept exemplars in order to measure the influence of a concept on a model's internal representation of input, are an important thread in this line of research. In this work we show that these explainability methods can suffer the same vulnerability to adversarial attacks as the models they are meant to analyze. We demonstrate this phenomenon on two well-known concept-based interpretability methods: TCAV and faceted feature visualization. We show that by carefully perturbing the examples of the concept that is being investigated, we can radically change the output of the interpretability method. The attacks that we propose can either induce positive interpretations (polka dots…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Machine Learning in Materials Science
