Adversarial Attack for Explanation Robustness of Rationalization Models

Yuankai Zhang; Lingxiao Kong; Haozhao Wang; Ruixuan Li; Jun Wang,; Yuhua Li; Wei Liu

arXiv:2408.10795·cs.CL·September 20, 2024

Adversarial Attack for Explanation Robustness of Rationalization Models

Yuankai Zhang, Lingxiao Kong, Haozhao Wang, Ruixuan Li, Jun Wang,, Yuhua Li, Wei Liu

PDF

Open Access

TL;DR

This paper investigates the robustness of rationalization models against adversarial attacks, revealing their vulnerability in explanation quality and proposing methods to undermine their interpretability without affecting predictions.

Contribution

It introduces UAT2E, a novel adversarial attack method targeting explanation robustness of rationalization models, and provides insights for enhancing their resilience.

Findings

01

Rationalization models are vulnerable to adversarial attacks affecting explanations.

02

Attacks cause models to select more meaningless tokens as rationales.

03

Recommendations are proposed to improve explanation robustness.

Abstract

Rationalization models, which select a subset of input text as rationale-crucial for humans to understand and trust predictions-have recently emerged as a prominent research area in eXplainable Artificial Intelligence. However, most of previous studies mainly focus on improving the quality of the rationale, ignoring its robustness to malicious attack. Specifically, whether the rationalization models can still generate high-quality rationale under the adversarial attack remains unknown. To explore this, this paper proposes UAT2E, which aims to undermine the explainability of rationalization models without altering their predictions, thereby eliciting distrust in these models from human users. UAT2E employs the gradient-based search on triggers and then inserts them into the original input to conduct both the non-target and target attack. Experimental results on five datasets reveal the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)

MethodsFocus