The Intriguing Relation Between Counterfactual Explanations and Adversarial Examples
Timo Freiesleben

TL;DR
This paper explores the relationship between counterfactual explanations and adversarial examples, highlighting their differences and similarities, and proposes a unified mathematical framework to analyze both concepts.
Contribution
It introduces a formal framework distinguishing CEs and AEs based on label relevance and proximity, and analyzes their interconnectedness in current methods.
Findings
CEs and AEs can be generated using similar techniques.
Differences in label relevance and proximity distinguish CEs from AEs.
The fields of CEs and AEs are likely to converge as their applications overlap.
Abstract
The same method that creates adversarial examples (AEs) to fool image-classifiers can be used to generate counterfactual explanations (CEs) that explain algorithmic decisions. This observation has led researchers to consider CEs as AEs by another name. We argue that the relationship to the true label and the tolerance with respect to proximity are two properties that formally distinguish CEs and AEs. Based on these arguments, we introduce CEs, AEs, and related concepts mathematically in a common framework. Furthermore, we show connections between current methods for generating CEs and AEs, and estimate that the fields will merge more and more as the number of common use-cases grows.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
