Making Corgis Important for Honeycomb Classification: Adversarial   Attacks on Concept-based Explainability Tools

Davis Brown; Henry Kvinge

arXiv:2110.07120·cs.LG·July 27, 2022·1 cites

Making Corgis Important for Honeycomb Classification: Adversarial Attacks on Concept-based Explainability Tools

Davis Brown, Henry Kvinge

PDF

Open Access

TL;DR

This paper reveals that concept-based interpretability methods like TCAV and faceted feature visualization are vulnerable to adversarial attacks, which can manipulate their explanations and compromise model interpretability in safety-critical applications.

Contribution

It demonstrates for the first time that concept-based interpretability tools can be attacked adversarially, showing how to manipulate explanations to mislead or obscure model reasoning.

Findings

01

Adversarial perturbations can radically alter interpretability outputs.

02

Attacks can produce false positive or negative concept importance.

03

Vulnerability poses risks for safety-critical AI applications.

Abstract

Methods for model explainability have become increasingly critical for testing the fairness and soundness of deep learning. Concept-based interpretability techniques, which use a small set of human-interpretable concept exemplars in order to measure the influence of a concept on a model's internal representation of input, are an important thread in this line of research. In this work we show that these explainability methods can suffer the same vulnerability to adversarial attacks as the models they are meant to analyze. We demonstrate this phenomenon on two well-known concept-based interpretability methods: TCAV and faceted feature visualization. We show that by carefully perturbing the examples of the concept that is being investigated, we can radically change the output of the interpretability method. The attacks that we propose can either induce positive interpretations (polka dots…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Machine Learning in Materials Science