Towards More Robust Interpretation via Local Gradient Alignment
Sunghwan Joo, Seokhyeon Jeong, Juyeon Heo, Adrian Weller, Taesup, Moon

TL;DR
This paper introduces a normalization-invariant approach to improve the robustness of neural network interpretation methods, demonstrating enhanced interpretability on large-scale datasets without sacrificing accuracy.
Contribution
It proposes a combined gradient regularization method based on $ ext{l}_2$ and cosine distance criteria, addressing normalization issues in robustness of feature attribution.
Findings
Models trained with the proposed method yield more robust interpretations.
The approach is effective on large-scale datasets like ImageNet-100.
It maintains model accuracy while improving interpretability robustness.
Abstract
Neural network interpretation methods, particularly feature attribution methods, are known to be fragile with respect to adversarial input perturbations. To address this, several methods for enhancing the local smoothness of the gradient while training have been proposed for attaining \textit{robust} feature attributions. However, the lack of considering the normalization of the attributions, which is essential in their visualizations, has been an obstacle to understanding and improving the robustness of feature attribution methods. In this paper, we provide new insights by taking such normalization into account. First, we show that for every non-negative homogeneous neural network, a naive -robust criterion for gradients is \textit{not} normalization invariant, which means that two functions with the same normalized gradient can have different values. Second, we formulate a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCOVID-19 diagnosis using AI · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
