Joint Universal Adversarial Perturbations with Interpretations
Liang-bo Ning, Zeyu Dai, Wenqi Fan, Jingran Su, Chao Pan, Luning Wang,, Qing Li

TL;DR
This paper introduces a novel framework for generating universal adversarial perturbations that simultaneously deceive deep neural networks and mislead their interpretability methods, highlighting a new security concern.
Contribution
It is the first to propose and empirically validate joint universal adversarial perturbations targeting both DNN predictions and their interpretation maps.
Findings
JUAP effectively fools DNN classifiers across datasets.
JUAP successfully misleads attribution maps, reducing interpretability.
First demonstration of joint attack on models and their explanations.
Abstract
Deep neural networks (DNNs) have significantly boosted the performance of many challenging tasks. Despite the great development, DNNs have also exposed their vulnerability. Recent studies have shown that adversaries can manipulate the predictions of DNNs by adding a universal adversarial perturbation (UAP) to benign samples. On the other hand, increasing efforts have been made to help users understand and explain the inner working of DNNs by highlighting the most informative parts (i.e., attribution maps) of samples with respect to their predictions. Moreover, we first empirically find that such attribution maps between benign and adversarial examples have a significant discrepancy, which has the potential to detect universal adversarial perturbations for defending against adversarial attacks. This finding motivates us to further investigate a new research problem: whether there exist…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
