Bridging Adversarial Robustness and Gradient Interpretability

Beomsu Kim; Junghoon Seo; Taegyun Jeon

arXiv:1903.11626·cs.LG·April 22, 2019·26 cites

Bridging Adversarial Robustness and Gradient Interpretability

Beomsu Kim, Junghoon Seo, Taegyun Jeon

PDF

Open Access 1 Repo

TL;DR

This paper explores how adversarial training improves the interpretability of loss gradients in neural networks by aligning them with human perception, and discusses the trade-offs involved.

Contribution

It provides a theoretical explanation linking adversarial robustness to gradient interpretability and proposes methods to balance accuracy and interpretability.

Findings

01

Adversarial training aligns gradients closer to the image manifold.

02

Gradients from adversarially trained models are more meaningful.

03

There exists a trade-off between accuracy and gradient interpretability.

Abstract

Adversarial training is a training scheme designed to counter adversarial attacks by augmenting the training dataset with adversarial examples. Surprisingly, several studies have observed that loss gradients from adversarially trained DNNs are visually more interpretable than those from standard DNNs. Although this phenomenon is interesting, there are only few works that have offered an explanation. In this paper, we attempted to bridge this gap between adversarial robustness and gradient interpretability. To this end, we identified that loss gradients from adversarially trained DNNs align better with human perception because adversarial training restricts gradients closer to the image manifold. We then demonstrated that adversarial training causes loss gradients to be quantitatively meaningful. Finally, we showed that under the adversarial training framework, there exists an empirical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

1202kbs/Robustness-and-Interpretability
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning

MethodsInterpretability