Unfooling Perturbation-Based Post Hoc Explainers

Zachariah Carmichael; Walter J Scheirer

arXiv:2205.14772·cs.AI·April 13, 2023·1 cites

Unfooling Perturbation-Based Post Hoc Explainers

Zachariah Carmichael, Walter J Scheirer

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper addresses the vulnerability of perturbation-based post hoc explainers like LIME and SHAP to adversarial attacks, proposing algorithms to detect and defend against such attacks to improve AI transparency.

Contribution

The authors formalize the problem of adversarial attacks on explainers and introduce CAD-Detect and CAD-Defend algorithms, including a novel anomaly detection method, to enhance explainability robustness.

Findings

01

Successfully detects adversarial concealment in black box systems

02

Mitigates adversarial attacks on LIME and SHAP explainers

03

Demonstrates effectiveness on real-world data

Abstract

Monumental advancements in artificial intelligence (AI) have lured the interest of doctors, lenders, judges, and other professionals. While these high-stakes decision-makers are optimistic about the technology, those familiar with AI systems are wary about the lack of transparency of its decision-making processes. Perturbation-based post hoc explainers offer a model agnostic means of interpreting these systems while only requiring query-level access. However, recent work demonstrates that these explainers can be fooled adversarially. This discovery has adverse implications for auditors, regulators, and other sentinels. With this in mind, several natural questions arise - how can we audit these black box systems? And how can we ascertain that the auditee is complying with the audit in good faith? In this work, we rigorously formalize this problem and devise a defense against adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

craymichael/unfooling
noneOfficial

Videos

Unfooling Perturbation-Based Post Hoc Explainers· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Explainable Artificial Intelligence (XAI)

MethodsShapley Additive Explanations · High-Order Consensuses · Local Interpretable Model-Agnostic Explanations