Explanation-Guided Adversarial Training for Robust and Interpretable Models
Chao Chen, Yanhui Chen, Shanshan Lin, Dongsheng Hong, Shu Wu, Xiangwen Liao, Chuanyi Liu

TL;DR
This paper introduces Explanation-Guided Adversarial Training (EGAT), a novel method that enhances neural network robustness and interpretability by combining adversarial training with explanation-based constraints, leading to more stable and human-understandable models.
Contribution
EGAT unifies adversarial training and explanation-guided learning to improve model robustness, interpretability, and performance against adversarial and out-of-distribution inputs.
Findings
EGAT outperforms baselines with +37% in accuracy.
EGAT produces more meaningful explanations.
EGAT requires only +16% more training time.
Abstract
Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
