CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
Wenxiao Wang, Lu Yao, Long Chen, Binbin Lin, Deng Cai, Xiaofei He and, Wei Liu

TL;DR
CrossFormer introduces cross-scale attention mechanisms and a dynamic position bias to enhance vision transformers' ability to process multi-scale features, leading to improved performance across various vision tasks.
Contribution
It proposes the Cross-scale Embedding Layer and Long Short Distance Attention to enable effective cross-scale feature interactions in vision transformers.
Findings
Outperforms existing vision transformers on image classification.
Achieves superior results in object detection and segmentation tasks.
Demonstrates versatility with variable-sized input handling.
Abstract
Transformers have made great progress in dealing with computer vision tasks. However, existing vision transformers do not yet possess the ability of building the interactions among features of different scales, which is perceptually important to visual inputs. The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the computational cost, some vision transformers merge adjacent embeddings inside the self-attention module, thus sacrificing small-scale (fine-grained) features of the embeddings and also disabling the cross-scale interactions. To this end, we propose Cross-scale Embedding Layer (CEL) and Long Short Distance Attention (LSDA). On the one hand, CEL blends each embedding with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
