Tricking Adversarial Attacks To Fail
Blerta Lindqvist

TL;DR
This paper introduces Target Training, a novel adversarial defense that redirects untargeted gradient-based attacks towards designated target classes, enabling accurate classification without prior attack knowledge.
Contribution
The paper proposes a new defense method that minimally alters classifiers and effectively redirects untargeted attacks, outperforming existing defenses on CIFAR10.
Findings
Achieves 86.2% accuracy on CW-L2 attack in CIFAR10
Eliminates need for attack knowledge and adversarial sample generation
Outperforms unsecured classifiers on non-adversarial samples
Abstract
Recent adversarial defense approaches have failed. Untargeted gradient-based attacks cause classifiers to choose any wrong class. Our novel white-box defense tricks untargeted attacks into becoming attacks targeted at designated target classes. From these target classes, we can derive the real classes. Our Target Training defense tricks the minimization at the core of untargeted, gradient-based adversarial attacks: minimize the sum of (1) perturbation and (2) classifier adversarial loss. Target Training changes the classifier minimally, and trains it with additional duplicated points (at 0 distance) labeled with designated classes. These differently-labeled duplicated samples minimize both terms (1) and (2) of the minimization, steering attack convergence to samples of designated classes, from which correct classification is derived. Importantly, Target Training eliminates the need to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Bacillus and Francisella bacterial research
