Defense Against Explanation Manipulation
Ruixiang Tang, Ninghao Liu, Fan Yang, Na Zou, Xia Hu

TL;DR
This paper introduces ATEX, a novel training approach that enhances the stability of machine learning explanations against manipulation, thereby increasing model robustness and explanation quality.
Contribution
The paper proposes ATEX, a training scheme that improves explanation stability without relying on explanation-specific optimization, addressing manipulation vulnerabilities from the training perspective.
Findings
ATEX increases explanation stability against manipulation.
ATEX enhances model robustness and explanation smoothing.
Improves effectiveness of adversarial training.
Abstract
Explainable machine learning attracts increasing attention as it improves transparency of models, which is helpful for machine learning to be trusted in real applications. However, explanation methods have recently been demonstrated to be vulnerable to manipulation, where we can easily change a model's explanation while keeping its prediction constant. To tackle this problem, some efforts have been paid to use more stable explanation methods or to change model configurations. In this work, we tackle the problem from the training perspective, and propose a new training scheme called Adversarial Training on EXplanations (ATEX) to improve the internal explanation stability of a model regardless of the specific explanation method being applied. Instead of directly specifying explanation values over data instances, ATEX only puts requirement on model predictions which avoids involving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Machine Learning and Data Classification
