Making Vision Transformers Truly Shift-Equivariant
Renan A. Rojas-Gomez, Teck-Yian Lim, Minh N. Do, Raymond A. Yeh

TL;DR
This paper introduces data-adaptive modules for Vision Transformers to achieve true shift-equivariance, ensuring consistent outputs under input shifts while maintaining competitive performance on classification and segmentation tasks.
Contribution
The paper proposes novel, data-adaptive modules for ViTs that enable shift-equivariance, a property lacking in traditional ViTs, across multiple architectures.
Findings
Achieves 100% shift consistency on four ViT models.
Maintains competitive accuracy on image classification tasks.
Performs well on semantic segmentation across datasets.
Abstract
For computer vision, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs' output remains sensitive to small spatial shifts in the input, i.e., not shift invariant. To address this shortcoming, we introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve true shift-equivariance on four well-established ViTs, namely, Swin, SwinV2, CvT, and MViTv2. Empirically, we evaluate the proposed adaptive models on image classification and semantic segmentation tasks. These models achieve competitive performance across three different datasets while maintaining 100% shift consistency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Industrial Vision Systems and Defect Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Depthwise Convolution · Pointwise Convolution · Residual Connection · Batch Normalization · Depthwise Separable Convolution · Convolution · Dense Connections
