Towards Robust Vision Transformer via Masked Adaptive Ensemble
Fudong Lin, Jiadong Lou, Xu Yuan, and Nian-Feng Tzeng

TL;DR
This paper introduces a novel Vision Transformer architecture with an adaptive ensemble and detection mechanism to improve robustness against adversarial attacks while maintaining high standard accuracy.
Contribution
It proposes a new ViT design with a detector and adaptive ensemble, enhancing robustness and accuracy trade-offs, and introduces a patch masking technique for better defense against adaptive attacks.
Findings
Achieves 90.3% standard accuracy on CIFAR-10.
Attains 49.8% adversarial robustness against attacks.
Outperforms existing methods in robustness and accuracy trade-offs.
Abstract
Adversarial training (AT) can help improve the robustness of Vision Transformers (ViT) against adversarial attacks by intentionally injecting adversarial examples into the training data. However, this way of adversarial injection inevitably incurs standard accuracy degradation to some extent, thereby calling for a trade-off between standard accuracy and robustness. Besides, the prominent AT solutions are still vulnerable to adaptive attacks. To tackle such shortcomings, this paper proposes a novel ViT architecture, including a detector and a classifier bridged by our newly developed adaptive ensemble. Specifically, we empirically discover that detecting adversarial examples can benefit from the Guided Backpropagation technique. Driven by this discovery, a novel Multi-head Self-Attention (MSA) mechanism is introduced to enhance our detector to sniff adversarial examples. Then, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
