Training Ensembles to Detect Adversarial Examples
Alexander Bagnall, Razvan Bunescu, Gordon Stewart

TL;DR
This paper introduces an ensemble approach that trains multiple models to detect adversarial examples by reducing agreement on out-of-distribution inputs, effectively identifying attacks like DeepFool and C&W on MNIST and CIFAR-10.
Contribution
The paper presents a novel ensemble training method that improves adversarial detection by balancing low error on benign data and disagreement on adversarial inputs.
Findings
Effective detection of adversarial examples across multiple attack types
High accuracy on MNIST and CIFAR-10 datasets
Robust against both white-box and black-box adversaries
Abstract
We propose a new ensemble method for detecting and classifying adversarial examples generated by state-of-the-art attacks, including DeepFool and C&W. Our method works by training the members of an ensemble to have low classification error on random benign examples while simultaneously minimizing agreement on examples outside the training distribution. We evaluate on both MNIST and CIFAR-10, against oblivious and both white- and black-box adversaries.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Bacillus and Francisella bacterial research
