TL;DR
This paper introduces a novel training method called ART that enhances the robustness of explanations in machine learning models by minimizing spatial correlation vulnerabilities, leading to improved attributional robustness and better object localization.
Contribution
The paper proposes a new attributional robustness training method using spatial alignment and triplet loss, achieving state-of-the-art robustness and localization performance.
Findings
Achieves 6-18% improvement in attributional robustness on standard datasets.
Demonstrates improved weakly supervised object localization performance.
Provides an upper bound for attributional vulnerability based on spatial correlation.
Abstract
Interpretability is an emerging area of research in trustworthy machine learning. Safe deployment of machine learning system mandates that the prediction and its explanation be reliable and robust. Recently, it has been shown that the explanations could be manipulated easily by adding visually imperceptible perturbations to the input while keeping the model's prediction intact. In this work, we study the problem of attributional robustness (i.e. models having robust explanations) by showing an upper bound for attributional vulnerability in terms of spatial correlation between the input image and its explanation map. We propose a training methodology that learns robust features by minimizing this upper bound using soft-margin triplet loss. Our methodology of robust attribution training (\textit{ART}) achieves the new state-of-the-art attributional robustness measure by a margin of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
