Intriguing Properties of Vision Transformers

Muzammal Naseer; Kanchana Ranasinghe; Salman Khan; Munawar Hayat,; Fahad Shahbaz Khan; Ming-Hsuan Yang

arXiv:2105.10497·cs.CV·November 29, 2021·301 cites

Intriguing Properties of Vision Transformers

Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat,, Fahad Shahbaz Khan, Ming-Hsuan Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper explores the robust and shape-aware properties of Vision Transformers (ViT), demonstrating their resilience to occlusions and perturbations, their shape bias, and their effectiveness in segmentation and ensemble learning.

Contribution

The study provides a comprehensive analysis of ViT's properties, revealing their robustness, shape bias, and potential for diverse vision tasks, which were previously underexplored.

Findings

01

ViTs retain high accuracy under severe occlusions and domain shifts.

02

ViTs are less biased towards textures and can recognize shapes similarly to humans.

03

Single ViT features can be combined for high-accuracy classification and few-shot learning.

Abstract

Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility in attending image-wide context conditioned on a given patch can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Muzammal-Naseer/Intriguing-Properties-of-Vision-Transformers
pytorch

Videos

Intriguing Properties of Vision Transformers· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection