TL;DR
This paper introduces an Alias-Free Vision Transformer that employs alias-free downsampling and linear attention to achieve shift invariance, enhancing robustness to image translations while maintaining competitive classification performance.
Contribution
It proposes a novel ViT architecture combining alias-free components and linear attention for improved translation robustness.
Findings
Outperforms similar models in adversarial translation robustness
Maintains competitive accuracy in image classification
Achieves shift invariance through novel architectural components
Abstract
Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation-invariant and are more sensitive to minor image translations than standard convnets. Previous studies have shown, however, that convnets are also not perfectly shift-invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti-aliasing approaches have been proposed to certify convnets' translation robustness. Building on this line of work, we propose an Alias-Free ViT, which combines two main components. First, it uses alias-free downsampling and nonlinearities. Second, it uses linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling a shift-invariant global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
