Blending Anti-Aliasing into Vision Transformer
Shengju Qian, Hao Shao, Yi Zhu, Mu Li, Jiaya Jia

TL;DR
This paper identifies aliasing artifacts in vision transformers caused by patch-wise tokenization and introduces a plug-and-play Anti-Aliasing Module (ARM) that improves performance, robustness, and data efficiency across multiple tasks.
Contribution
The paper presents a novel anti-aliasing module for vision transformers, addressing a previously uncharted problem and enhancing their performance and robustness.
Findings
ARM reduces aliasing artifacts effectively.
Improves accuracy and robustness across multiple vision transformer models.
Enhances data efficiency in vision transformer applications.
Abstract
The transformer architectures, based on self-attention mechanism and convolution-free design, recently found superior performance and booming applications in computer vision. However, the discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps, arising the traditional problem of aliasing for vision transformers. Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions. Recent researches have found that modern convolution networks still suffer from this phenomenon. In this work, we analyze the uncharted problem of aliasing in vision transformer and explore to incorporate anti-aliasing properties. Specifically, we propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue. We investigate the effectiveness and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · Convolution
