Recovering Localized Adversarial Attacks
Jan Philip G\"opfert, Heiko Wersing, Barbara Hammer

TL;DR
This paper evaluates the effectiveness of explainability methods in identifying adversarially manipulated regions in images classified by deep neural networks, highlighting LIME's superior performance in such scenarios.
Contribution
The study provides a comparative analysis of explainers' ability to detect adversarial attack regions, emphasizing LIME's effectiveness in this context.
Findings
LIME outperforms other explainers in identifying adversarial regions.
Explainability methods can be used to detect adversarial attacks.
Deep neural network explainability varies significantly across methods.
Abstract
Deep convolutional neural networks have achieved great successes over recent years, particularly in the domain of computer vision. They are fast, convenient, and -- thanks to mature frameworks -- relatively easy to implement and deploy. However, their reasoning is hidden inside a black box, in spite of a number of proposed approaches that try to provide human-understandable explanations for the predictions of neural networks. It is still a matter of debate which of these explainers are best suited for which situations, and how to quantitatively evaluate and compare them. In this contribution, we focus on the capabilities of explainers for convolutional deep neural networks in an extreme situation: a setting in which humans and networks fundamentally disagree. Deep neural networks are susceptible to adversarial attacks that deliberately modify input samples to mislead a neural network's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLocal Interpretable Model-Agnostic Explanations
