Making Vision Transformers Truly Shift-Equivariant

Renan A. Rojas-Gomez; Teck-Yian Lim; Minh N. Do; Raymond A. Yeh

arXiv:2305.16316·cs.CV·November 30, 2023·2 cites

Making Vision Transformers Truly Shift-Equivariant

Renan A. Rojas-Gomez, Teck-Yian Lim, Minh N. Do, Raymond A. Yeh

PDF

Open Access

TL;DR

This paper introduces data-adaptive modules for Vision Transformers to achieve true shift-equivariance, ensuring consistent outputs under input shifts while maintaining competitive performance on classification and segmentation tasks.

Contribution

The paper proposes novel, data-adaptive modules for ViTs that enable shift-equivariance, a property lacking in traditional ViTs, across multiple architectures.

Findings

01

Achieves 100% shift consistency on four ViT models.

02

Maintains competitive accuracy on image classification tasks.

03

Performs well on semantic segmentation across datasets.

Abstract

For computer vision, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs' output remains sensitive to small spatial shifts in the input, i.e., not shift invariant. To address this shortcoming, we introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve true shift-equivariance on four well-established ViTs, namely, Swin, SwinV2, CvT, and MViTv2. Empirically, we evaluate the proposed adaptive models on image classification and semantic segmentation tasks. These models achieve competitive performance across three different datasets while maintaining 100% shift consistency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Industrial Vision Systems and Defect Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Depthwise Convolution · Pointwise Convolution · Residual Connection · Batch Normalization · Depthwise Separable Convolution · Convolution · Dense Connections