Reviving Shift Equivariance in Vision Transformers
Peijian Ding, Davit Soselia, Thomas Armstrong, Jiahao Su, and Furong, Huang

TL;DR
This paper introduces an adaptive polyphase anchoring method to restore shift-equivariance in vision transformers, significantly improving their robustness and prediction consistency under input shifts and transformations.
Contribution
It proposes a novel adaptive polyphase anchoring algorithm that ensures shift-equivariance in vision transformers, addressing a key limitation of existing models.
Findings
Achieves 100% shift consistency in predictions.
Demonstrates robustness to cropping, flipping, and affine transformations.
Maintains high accuracy even under input shifts that reduce baseline models' performance.
Abstract
Shift equivariance is a fundamental principle that governs how we perceive the world - our recognition of an object remains invariant with respect to shifts. Transformers have gained immense popularity due to their effectiveness in both language and vision tasks. While the self-attention operator in vision transformers (ViT) is permutation-equivariant and thus shift-equivariant, patch embedding, positional encoding, and subsampled attention in ViT variants can disrupt this property, resulting in inconsistent predictions even under small shift perturbations. Although there is a growing trend in incorporating the inductive bias of convolutional neural networks (CNNs) into vision transformers, it does not fully address the issue. We propose an adaptive polyphase anchoring algorithm that can be seamlessly integrated into vision transformer models to ensure shift-equivariance in patch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Softmax · Convolution · Dense Connections · Vision Transformer
