Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, Himabindu, Lakkaraju

TL;DR
This paper reveals that popular explanation methods like LIME and SHAP can be easily fooled by adversarial techniques that hide biases, raising concerns about their reliability in critical domains.
Contribution
The authors introduce a novel scaffolding approach that can manipulate explanations of any classifier without altering its biased predictions, exposing vulnerabilities in explanation methods.
Findings
Adversarial scaffolding can hide biases from explanations
LIME and SHAP can be fooled into providing innocuous explanations
Biased classifiers can be made to appear unbiased in explanations
Abstract
As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education
MethodsShapley Additive Explanations · Local Interpretable Model-Agnostic Explanations
