Reviving Shift Equivariance in Vision Transformers

Peijian Ding; Davit Soselia; Thomas Armstrong; Jiahao Su; and Furong; Huang

arXiv:2306.07470·cs.CV·June 14, 2023·1 cites

Reviving Shift Equivariance in Vision Transformers

Peijian Ding, Davit Soselia, Thomas Armstrong, Jiahao Su, and Furong, Huang

PDF

Open Access

TL;DR

This paper introduces an adaptive polyphase anchoring method to restore shift-equivariance in vision transformers, significantly improving their robustness and prediction consistency under input shifts and transformations.

Contribution

It proposes a novel adaptive polyphase anchoring algorithm that ensures shift-equivariance in vision transformers, addressing a key limitation of existing models.

Findings

01

Achieves 100% shift consistency in predictions.

02

Demonstrates robustness to cropping, flipping, and affine transformations.

03

Maintains high accuracy even under input shifts that reduce baseline models' performance.

Abstract

Shift equivariance is a fundamental principle that governs how we perceive the world - our recognition of an object remains invariant with respect to shifts. Transformers have gained immense popularity due to their effectiveness in both language and vision tasks. While the self-attention operator in vision transformers (ViT) is permutation-equivariant and thus shift-equivariant, patch embedding, positional encoding, and subsampled attention in ViT variants can disrupt this property, resulting in inconsistent predictions even under small shift perturbations. Although there is a growing trend in incorporating the inductive bias of convolutional neural networks (CNNs) into vision transformers, it does not fully address the issue. We propose an adaptive polyphase anchoring algorithm that can be seamlessly integrated into vision transformer models to ensure shift-equivariance in patch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Softmax · Convolution · Dense Connections · Vision Transformer