Adversarial Examples Are Not Bugs, They Are Superposition
Liv Gorton, Owen Lewis

TL;DR
This paper explores the hypothesis that superposition, a concept from interpretability, is a key factor in adversarial examples in neural networks, supported by theoretical and experimental evidence.
Contribution
It provides four lines of evidence suggesting superposition is a major cause of adversarial phenomena, extending prior hypotheses with new experimental results.
Findings
Superposition can explain adversarial phenomena theoretically.
Intervening on superposition controls robustness in toy models.
Adversarial training affects superposition in ResNet18.
Abstract
Adversarial examples -- inputs with imperceptible perturbations that fool neural networks -- remain one of deep learning's most perplexing phenomena despite nearly a decade of research. While numerous defenses and explanations have been proposed, there is no consensus on the fundamental mechanism. One underexplored hypothesis is that superposition, a concept from mechanistic interpretability, may be a major contributing factor, or even the primary cause. We present four lines of evidence in support of this hypothesis, greatly extending prior arguments by Elhage et al. (2022): (1) superposition can theoretically explain a range of adversarial phenomena, (2) in toy models, intervening on superposition controls robustness, (3) in toy models, intervening on robustness (via adversarial training) controls superposition, and (4) in ResNet18, intervening on robustness (via adversarial training)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
