TL;DR
This paper introduces Octic Vision Transformers that leverage octic group equivariance to efficiently exploit geometric symmetries like rotations and reflections, significantly reducing computational costs while maintaining accuracy.
Contribution
We propose octic linear layers for ViTs that achieve substantial FLOPs and memory reductions, enabling efficient equivariant Vision Transformers without sacrificing performance.
Findings
Octic ViTs reduce FLOPs by 5.33x and memory by up to 8x.
Octic ViTs match baseline accuracy on ImageNet-1K.
Efficient equivariant ViTs can be trained with supervised and unsupervised methods.
Abstract
Why are state-of-the-art Vision Transformers (ViTs) not designed to exploit natural geometric symmetries such as 90-degree rotations and reflections? In this paper, we argue that there is no fundamental reason, and what has been missing is an efficient implementation. To this end, we introduce Octic Vision Transformers (octic ViTs) which rely on octic group equivariance to capture these symmetries. In contrast to prior equivariant models that increase computational cost, our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory compared to ordinary linear layers. In full octic ViT blocks the computational reductions approach the reductions in the linear layers with increased embedding dimension. We study two new families of ViTs, built from octic blocks, that are either fully octic equivariant or break equivariance in the last part of the network.…
Peer Reviews
Decision·Submitted to ICLR 2026
- Good comparison of end-to-end equivariant ($D_8$), late invariant ($I_8$), late non-equivariant ($H_8$) and their performance. - Extensive experiments on ImageNet classification and an SSL task show that equivariance can help.
A critical weakness of this paper in my opinion is the limited novelty. Equivariant ViTs have been proposed previously [1, 2, 3] and octic ViTs are a special case. It is also well known that weight sharing/weight tying reduces the number of FLOPs as a direct consequence of the reduction of the number of parameters [4, 5]. Furthermore, several papers have shown that some form of symmetry breaking, especially in the later layers, can be beneficial for CNNs [4, 5, 6], albeit not for ViTs. Thus the
- Significant Compute Efficiency: Drastically reduces FLOPs for SOTA ViTs without sacrificing accuracy. - Efficient Equivariance: Unlike prior work where equivariance adds overhead, this paper uses Fourier domain sparsity to accelerate ViTs. - SOTA Validation: Validated at scale on SOTA training recipes, proving practical utility. - Architectural Insights: The ablation between hybrid and fully invariant models provides valuable design insight.
- FLOPs vs. Throughput Mismatch: The large FLOPs reduction does not fully translate to throughput/speedup (only 1.47x). - Implementation Complexity: The method is complex, requiring knowledge of group representation theory, Fourier transforms, and custom Triton kernels. - Non-linearity Overhead: GELU activations must be applied in the spatial domain ($\rho_{reg}$), requiring costly round-trips via Fourier transforms.
The experiments show that the method has good efficiency, though not so comprehensive, which will be discussed below. The proposed method is interesting. Applying octic group to ViT is reasonable and looks have wide application domain.
My first concern is that the mechanism is not well motivated. I would suggest the authors adding in more explaination why the proposed method has better efficiency in the introduction and method sections. Currently, the authors justify the method that leveraging a larger group can yield "faster, stronger, and more compact models" without further establishment of this claim. I agree this could be right, but this leads to two questions: (1) Why not use a group that is even larger? (2) Would it be
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
