Vanilla Group Equivariant Vision Transformer: Simple and Effective

Jiahong Fu; Qi Xie; Deyu Meng; Zongben Xu

arXiv:2602.08047·cs.CV·February 10, 2026

Vanilla Group Equivariant Vision Transformer: Simple and Effective

Jiahong Fu, Qi Xie, Deyu Meng, Zongben Xu

PDF

Open Access

TL;DR

This paper introduces a simple, theoretically grounded framework for making Vision Transformers fully equivariant, improving their performance and data efficiency across various vision tasks.

Contribution

It systematically renders key ViT components equivariant, enabling a plug-and-play, scalable, and effective approach that enhances existing architectures like Swin Transformers.

Findings

01

Consistently improves vision task performance

02

Enhances data efficiency in training

03

Scales seamlessly to complex ViT architectures

Abstract

Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · Face Recognition and Perception