Understanding The Robustness in Vision Transformers
Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar,, Jiashi Feng, Jose M. Alvarez

TL;DR
This paper investigates how self-attention in Vision Transformers contributes to robustness, introduces fully attentional networks to enhance this property, and demonstrates state-of-the-art results on multiple vision tasks.
Contribution
It provides a systematic analysis of self-attention's role in robustness and proposes a new family of fully attentional networks with improved performance.
Findings
Achieves 87.1% accuracy on ImageNet-1k
Sets new state-of-the-art robustness on ImageNet-C
Improves downstream task performance in segmentation and detection
Abstract
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
