When and How to Fool Explainable Models (and Humans) with Adversarial Examples
Jon Vadillo, Roberto Santana, Jose A. Lozano

TL;DR
This paper explores the potential and limitations of adversarial attacks on explainable machine learning models, emphasizing human assessment and proposing a comprehensive framework for generating such attacks.
Contribution
It introduces a novel framework for studying adversarial examples in explainable models, considering human factors and diverse attack scenarios.
Findings
Extended adversarial example concept for explainable models.
Proposed a comprehensive attack framework considering human assessment.
Illustrated novel attack paradigms for deceiving explainable models.
Abstract
Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out-of-distribution inputs. In this exploratory review, we explore the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios, in which the inputs, the output classifications and the explanations of the model's decisions are assessed by humans. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment, introducing and illustrating novel attack paradigms. In particular, our framework considers a wide range of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications
