Explanation-Guided Adversarial Training for Robust and Interpretable Models

Chao Chen; Yanhui Chen; Shanshan Lin; Dongsheng Hong; Shu Wu; Xiangwen Liao; Chuanyi Liu

arXiv:2603.01938·cs.LG·March 3, 2026

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Chao Chen, Yanhui Chen, Shanshan Lin, Dongsheng Hong, Shu Wu, Xiangwen Liao, Chuanyi Liu

PDF

Open Access

TL;DR

This paper introduces Explanation-Guided Adversarial Training (EGAT), a novel method that enhances neural network robustness and interpretability by combining adversarial training with explanation-based constraints, leading to more stable and human-understandable models.

Contribution

EGAT unifies adversarial training and explanation-guided learning to improve model robustness, interpretability, and performance against adversarial and out-of-distribution inputs.

Findings

01

EGAT outperforms baselines with +37% in accuracy.

02

EGAT produces more meaningful explanations.

03

EGAT requires only +16% more training time.

Abstract

Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis