Bridging Adversarial Robustness and Gradient Interpretability
Beomsu Kim, Junghoon Seo, Taegyun Jeon

TL;DR
This paper explores how adversarial training improves the interpretability of loss gradients in neural networks by aligning them with human perception, and discusses the trade-offs involved.
Contribution
It provides a theoretical explanation linking adversarial robustness to gradient interpretability and proposes methods to balance accuracy and interpretability.
Findings
Adversarial training aligns gradients closer to the image manifold.
Gradients from adversarially trained models are more meaningful.
There exists a trade-off between accuracy and gradient interpretability.
Abstract
Adversarial training is a training scheme designed to counter adversarial attacks by augmenting the training dataset with adversarial examples. Surprisingly, several studies have observed that loss gradients from adversarially trained DNNs are visually more interpretable than those from standard DNNs. Although this phenomenon is interesting, there are only few works that have offered an explanation. In this paper, we attempted to bridge this gap between adversarial robustness and gradient interpretability. To this end, we identified that loss gradients from adversarially trained DNNs align better with human perception because adversarial training restricts gradients closer to the image manifold. We then demonstrated that adversarial training causes loss gradients to be quantitatively meaningful. Finally, we showed that under the adversarial training framework, there exists an empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning
MethodsInterpretability
