Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention
Sitong Wu, Tianyi Wu, Haoru Tan, Guodong Guo

TL;DR
Pale Transformer introduces a novel pale-shaped self-attention mechanism that balances efficiency and context modeling, leading to a versatile vision transformer backbone with superior accuracy on multiple vision tasks.
Contribution
The paper proposes the Pale-Shaped self-Attention (PS-Attention) and develops a hierarchical Pale Transformer backbone that outperforms existing models in accuracy and efficiency.
Findings
Achieves over 83% Top-1 accuracy on ImageNet-1K with 22M parameters.
Outperforms state-of-the-art on ADE20K semantic segmentation.
Excels in COCO object detection and instance segmentation tasks.
Abstract
Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by the global self-attention, various methods constrain the range of attention within a local region to improve its efficiency. Consequently, their receptive fields in a single attention layer are not large enough, resulting in insufficient context modeling. To address this issue, we propose a Pale-Shaped self-Attention (PS-Attention), which performs self-attention within a pale-shaped region. Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly. Meanwhile, it can capture richer contextual information under the similar computation complexity with previous local self-attention mechanisms. Based on the PS-Attention, we develop a general Vision Transformer backbone with a hierarchical architecture,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Dense Connections
