Unfooling Perturbation-Based Post Hoc Explainers
Zachariah Carmichael, Walter J Scheirer

TL;DR
This paper addresses the vulnerability of perturbation-based post hoc explainers like LIME and SHAP to adversarial attacks, proposing algorithms to detect and defend against such attacks to improve AI transparency.
Contribution
The authors formalize the problem of adversarial attacks on explainers and introduce CAD-Detect and CAD-Defend algorithms, including a novel anomaly detection method, to enhance explainability robustness.
Findings
Successfully detects adversarial concealment in black box systems
Mitigates adversarial attacks on LIME and SHAP explainers
Demonstrates effectiveness on real-world data
Abstract
Monumental advancements in artificial intelligence (AI) have lured the interest of doctors, lenders, judges, and other professionals. While these high-stakes decision-makers are optimistic about the technology, those familiar with AI systems are wary about the lack of transparency of its decision-making processes. Perturbation-based post hoc explainers offer a model agnostic means of interpreting these systems while only requiring query-level access. However, recent work demonstrates that these explainers can be fooled adversarially. This discovery has adverse implications for auditors, regulators, and other sentinels. With this in mind, several natural questions arise - how can we audit these black box systems? And how can we ascertain that the auditee is complying with the audit in good faith? In this work, we rigorously formalize this problem and devise a defense against adversarial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Explainable Artificial Intelligence (XAI)
MethodsShapley Additive Explanations · High-Order Consensuses · Local Interpretable Model-Agnostic Explanations
