CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale   Attention

Wenxiao Wang; Lu Yao; Long Chen; Binbin Lin; Deng Cai; Xiaofei He and; Wei Liu

arXiv:2108.00154·cs.CV·October 11, 2021·85 cites

CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention

Wenxiao Wang, Lu Yao, Long Chen, Binbin Lin, Deng Cai, Xiaofei He and, Wei Liu

PDF

Open Access 4 Repos 1 Models 1 Video

TL;DR

CrossFormer introduces cross-scale attention mechanisms and a dynamic position bias to enhance vision transformers' ability to process multi-scale features, leading to improved performance across various vision tasks.

Contribution

It proposes the Cross-scale Embedding Layer and Long Short Distance Attention to enable effective cross-scale feature interactions in vision transformers.

Findings

01

Outperforms existing vision transformers on image classification.

02

Achieves superior results in object detection and segmentation tasks.

03

Demonstrates versatility with variable-sized input handling.

Abstract

Transformers have made great progress in dealing with computer vision tasks. However, existing vision transformers do not yet possess the ability of building the interactions among features of different scales, which is perceptually important to visual inputs. The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the computational cost, some vision transformers merge adjacent embeddings inside the self-attention module, thus sacrificing small-scale (fine-grained) features of the embeddings and also disabling the cross-scale interactions. To this end, we propose Cross-scale Embedding Layer (CEL) and Long Short Distance Attention (LSDA). On the one hand, CEL blends each embedding with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
birder-project/crossformer_s_arabian-peninsula
model· 20 dl
20 dl

Videos

CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection