Understanding The Robustness in Vision Transformers

Daquan Zhou; Zhiding Yu; Enze Xie; Chaowei Xiao; Anima Anandkumar,; Jiashi Feng; Jose M. Alvarez

arXiv:2204.12451·cs.CV·November 9, 2022·34 cites

Understanding The Robustness in Vision Transformers

Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar,, Jiashi Feng, Jose M. Alvarez

PDF

Open Access 2 Repos

TL;DR

This paper investigates how self-attention in Vision Transformers contributes to robustness, introduces fully attentional networks to enhance this property, and demonstrates state-of-the-art results on multiple vision tasks.

Contribution

It provides a systematic analysis of self-attention's role in robustness and proposes a new family of fully attentional networks with improved performance.

Findings

01

Achieves 87.1% accuracy on ImageNet-1k

02

Sets new state-of-the-art robustness on ImageNet-C

03

Improves downstream task performance in segmentation and detection

Abstract

Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning