Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients
Andrew Slavin Ross, Finale Doshi-Velez

TL;DR
This paper proposes input gradient regularization to improve the robustness and interpretability of deep neural networks, making their predictions more reliable and understandable against adversarial attacks.
Contribution
It introduces a differentiable regularization method on input gradients that enhances adversarial robustness and interpretability across various models and datasets.
Findings
Neural networks with input gradient regularization resist transfer attacks.
Adversarial examples for regularized models are more interpretable.
Regularization improves the natural interpretability of model rationales.
Abstract
Deep neural networks have proven remarkably effective at solving many classification problems, but have been criticized recently for two major weaknesses: the reasons behind their predictions are uninterpretable, and the predictions themselves can often be fooled by small adversarial perturbations. These problems pose major obstacles for the adoption of neural networks in domains that require security or transparency. In this work, we evaluate the effectiveness of defenses that differentiably penalize the degree to which small changes in inputs can alter model predictions. Across multiple attacks, architectures, defenses, and datasets, we find that neural networks trained with this input gradient regularization exhibit robustness to transferred adversarial examples generated to fool all of the other models. We also find that adversarial examples generated to fool gradient-regularized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
