Fooling Explanations in Text Classifiers
Adam Ivankay, Ivan Girardi, Chiara Marchiori, Pascal Frossard

TL;DR
This paper demonstrates that explanation methods for text classifiers are vulnerable to imperceptible perturbations, which can significantly alter explanations without changing the classifier's predictions, highlighting the fragility of current explanation techniques.
Contribution
Introduces TextExplanationFooler (TEF), a novel attack algorithm that exposes the fragility of explanation methods in text classifiers across multiple models and datasets.
Findings
TEF significantly reduces attribution correlation across models and explanation methods.
Perturbations transfer effectively to unseen models and explanation methods.
A semi-universal attack computes fast, light perturbations without model or explanation knowledge.
Abstract
State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling
