Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation   Methods

Dylan Slack; Sophie Hilgard; Emily Jia; Sameer Singh; Himabindu; Lakkaraju

arXiv:1911.02508·cs.LG·February 4, 2020·169 cites

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, Himabindu, Lakkaraju

PDF

Open Access 2 Repos

TL;DR

This paper reveals that popular explanation methods like LIME and SHAP can be easily fooled by adversarial techniques that hide biases, raising concerns about their reliability in critical domains.

Contribution

The authors introduce a novel scaffolding approach that can manipulate explanations of any classifier without altering its biased predictions, exposing vulnerabilities in explanation methods.

Findings

01

Adversarial scaffolding can hide biases from explanations

02

LIME and SHAP can be fooled into providing innocuous explanations

03

Biased classifiers can be made to appear unbiased in explanations

Abstract

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education

MethodsShapley Additive Explanations · Local Interpretable Model-Agnostic Explanations