Interpreting and Improving Adversarial Robustness of Deep Neural Networks with Neuron Sensitivity
Chongzhi Zhang, Aishan Liu, Xianglong Liu, Yitao Xu, Hang Yu, Yuqing, Ma, Tianlin Li

TL;DR
This paper explores how neuron sensitivity relates to adversarial robustness in deep neural networks, proposing a method to improve robustness by stabilizing sensitive neuron behaviors and validating the approach through extensive experiments.
Contribution
It introduces a novel perspective on adversarial robustness based on neuron sensitivity and proposes a method to enhance robustness by constraining sensitive neuron similarities.
Findings
Sensitive neurons significantly influence adversarial predictions.
Constraining neuron sensitivities improves model robustness.
State-of-the-art adversarial training reduces neuron sensitivities.
Abstract
Deep neural networks (DNNs) are vulnerable to adversarial examples where inputs with imperceptible perturbations mislead DNNs to incorrect results. Despite the potential risk they bring, adversarial examples are also valuable for providing insights into the weakness and blind-spots of DNNs. Thus, the interpretability of a DNN in the adversarial setting aims to explain the rationale behind its decision-making process and makes deeper understanding which results in better practical applications. To address this issue, we try to explain adversarial robustness for deep models from a new perspective of neuron sensitivity which is measured by neuron behavior variation intensity against benign and adversarial examples. In this paper, we first draw the close connection between adversarial robustness and neuron sensitivities, as sensitive neurons make the most non-trivial contributions to model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsInterpretability
