Adversarial Perturbations Are Not So Weird: Entanglement of Robust and   Non-Robust Features in Neural Network Classifiers

Jacob M. Springer; Melanie Mitchell; Garrett T. Kenyon

arXiv:2102.05110·cs.LG·February 11, 2021·1 cites

Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers

Jacob M. Springer, Melanie Mitchell, Garrett T. Kenyon

PDF

Open Access

TL;DR

This paper investigates the entanglement of robust and non-robust features in neural networks, revealing how small, non-semantic patterns contribute to adversarial vulnerability and how robust classifiers can generate more transferable adversarial examples.

Contribution

It extends prior work by analyzing the nature of entangled features, showing non-robust features respond to small patterns entangled with robust ones, and demonstrates robust classifiers' effectiveness in generating transferable adversarial examples.

Findings

01

Non-robust features respond to small, entangled patterns.

02

Adversarial examples can be created by minimal perturbations to these patterns.

03

Robust classifiers are more effective in generating transferable adversarial examples.

Abstract

Neural networks trained on visual data are well-known to be vulnerable to often imperceptible adversarial perturbations. The reasons for this vulnerability are still being debated in the literature. Recently Ilyas et al. (2019) showed that this vulnerability arises, in part, because neural network classifiers rely on highly predictive but brittle "non-robust" features. In this paper we extend the work of Ilyas et al. by investigating the nature of the input patterns that give rise to these features. In particular, we hypothesize that in a neural network trained in a standard way, non-robust features respond to small, "non-semantic" patterns that are typically entangled with larger, robust patterns, known to be more human-interpretable, as opposed to solely responding to statistical artifacts in a dataset. Thus, adversarial examples can be formed via minimal perturbations to these small,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning