Improving the Adversarial Robustness and Interpretability of Deep Neural   Networks by Regularizing their Input Gradients

Andrew Slavin Ross; Finale Doshi-Velez

arXiv:1711.09404·cs.LG·November 28, 2017·281 cites

Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients

Andrew Slavin Ross, Finale Doshi-Velez

PDF

Open Access 1 Repo

TL;DR

This paper proposes input gradient regularization to improve the robustness and interpretability of deep neural networks, making their predictions more reliable and understandable against adversarial attacks.

Contribution

It introduces a differentiable regularization method on input gradients that enhances adversarial robustness and interpretability across various models and datasets.

Findings

01

Neural networks with input gradient regularization resist transfer attacks.

02

Adversarial examples for regularized models are more interpretable.

03

Regularization improves the natural interpretability of model rationales.

Abstract

Deep neural networks have proven remarkably effective at solving many classification problems, but have been criticized recently for two major weaknesses: the reasons behind their predictions are uninterpretable, and the predictions themselves can often be fooled by small adversarial perturbations. These problems pose major obstacles for the adoption of neural networks in domains that require security or transparency. In this work, we evaluate the effectiveness of defenses that differentiably penalize the degree to which small changes in inputs can alter model predictions. Across multiple attacks, architectures, defenses, and datasets, we find that neural networks trained with this input gradient regularization exhibit robustness to transferred adversarial examples generated to fool all of the other models. We also find that adversarial examples generated to fool gradient-regularized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dtak/adversarial-robustness-public
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)