Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped   Attention

Sitong Wu; Tianyi Wu; Haoru Tan; Guodong Guo

arXiv:2112.14000·cs.CV·December 30, 2021

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Sitong Wu, Tianyi Wu, Haoru Tan, Guodong Guo

PDF

Open Access 2 Repos 1 Video

TL;DR

Pale Transformer introduces a novel pale-shaped self-attention mechanism that balances efficiency and context modeling, leading to a versatile vision transformer backbone with superior accuracy on multiple vision tasks.

Contribution

The paper proposes the Pale-Shaped self-Attention (PS-Attention) and develops a hierarchical Pale Transformer backbone that outperforms existing models in accuracy and efficiency.

Findings

01

Achieves over 83% Top-1 accuracy on ImageNet-1K with 22M parameters.

02

Outperforms state-of-the-art on ADE20K semantic segmentation.

03

Excels in COCO object detection and instance segmentation tasks.

Abstract

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by the global self-attention, various methods constrain the range of attention within a local region to improve its efficiency. Consequently, their receptive fields in a single attention layer are not large enough, resulting in insufficient context modeling. To address this issue, we propose a Pale-Shaped self-Attention (PS-Attention), which performs self-attention within a pale-shaped region. Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly. Meanwhile, it can capture richer contextual information under the similar computation complexity with previous local self-attention mechanisms. Based on the PS-Attention, we develop a general Vision Transformer backbone with a hierarchical architecture,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention· underline

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Dense Connections