TL;DR
This paper introduces a novel training strategy with regularizers to enhance the attributional robustness of deep neural networks, making their explanations more consistent under adversarial attacks across multiple datasets.
Contribution
It proposes two new regularizers specifically designed to preserve attribution maps during attacks, surpassing existing methods in attributional robustness.
Findings
Achieves 3-9% improvement in attribution robustness measures
Effective across datasets: MNIST, FMNIST, Flower, GTSRB
Provides a systematic approach to improve trustworthiness of explanations
Abstract
Deep neural networks are the default choice of learning models for computer vision tasks. Extensive work has been carried out in recent years on explaining deep models for vision tasks such as classification. However, recent work has shown that it is possible for these models to produce substantially different attribution maps even when two very similar images are given to the network, raising serious questions about trustworthiness. To address this issue, we propose a robust attribution training strategy to improve attributional robustness of deep neural networks. Our method carefully analyzes the requirements for attributional robustness and introduces two new regularizers that preserve a model's attribution map during attacks. Our method surpasses state-of-the-art attributional robustness methods by a margin of approximately 3% to 9% in terms of attribution robustness measures on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
