Vision Transformers are Robust Learners
Sayak Paul, Pin-Yu Chen

TL;DR
This paper evaluates the robustness of Vision Transformers (ViT) against various challenges and compares their performance with CNNs, revealing that ViTs are more robust learners due to unique properties like Fourier spectrum sensitivity.
Contribution
It provides a comprehensive robustness evaluation of ViT models, including novel analyses explaining their superior robustness over CNNs.
Findings
ViT achieves 4.3x higher accuracy on ImageNet-A compared to BiT.
ViT models show enhanced robustness against corruptions and adversarial examples.
Analyses reveal properties like Fourier spectrum sensitivity contribute to ViT robustness.
Abstract
Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy. What remains largely unexplored is their robustness evaluation and attribution. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Residual Connection · Dense Connections · Adam · Vision Transformer
