Vision Transformers are Robust Learners

Sayak Paul; Pin-Yu Chen

arXiv:2105.07581·cs.CV·December 7, 2021

Vision Transformers are Robust Learners

Sayak Paul, Pin-Yu Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper evaluates the robustness of Vision Transformers (ViT) against various challenges and compares their performance with CNNs, revealing that ViTs are more robust learners due to unique properties like Fourier spectrum sensitivity.

Contribution

It provides a comprehensive robustness evaluation of ViT models, including novel analyses explaining their superior robustness over CNNs.

Findings

01

ViT achieves 4.3x higher accuracy on ImageNet-A compared to BiT.

02

ViT models show enhanced robustness against corruptions and adversarial examples.

03

Analyses reveal properties like Fourier spectrum sensitivity contribute to ViT robustness.

Abstract

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy. What remains largely unexplored is their robustness evaluation and attribution. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sayakpaul/robustness-vit
pytorchOfficial

Videos

Vision Transformers are Robust Learners· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Residual Connection · Dense Connections · Adam · Vision Transformer