Fooling Explanations in Text Classifiers

Adam Ivankay; Ivan Girardi; Chiara Marchiori; Pascal Frossard

arXiv:2206.03178·cs.LG·June 8, 2022·5 cites

Fooling Explanations in Text Classifiers

Adam Ivankay, Ivan Girardi, Chiara Marchiori, Pascal Frossard

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that explanation methods for text classifiers are vulnerable to imperceptible perturbations, which can significantly alter explanations without changing the classifier's predictions, highlighting the fragility of current explanation techniques.

Contribution

Introduces TextExplanationFooler (TEF), a novel attack algorithm that exposes the fragility of explanation methods in text classifiers across multiple models and datasets.

Findings

01

TEF significantly reduces attribution correlation across models and explanation methods.

02

Perturbations transfer effectively to unseen models and explanation methods.

03

A semi-universal attack computes fast, light perturbations without model or explanation knowledge.

Abstract

State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Fooling Explanations in Text Classifiers· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling