CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale   Attention

Wenxiao Wang; Wei Chen; Qibo Qiu; Long Chen; Boxi Wu; Binbin Lin,; Xiaofei He; Wei Liu

arXiv:2303.06908·cs.CV·December 5, 2023·6 cites

CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

Wenxiao Wang, Wei Chen, Qibo Qiu, Long Chen, Boxi Wu, Binbin Lin,, Xiaofei He, Wei Liu

PDF

Open Access 1 Repo

TL;DR

CrossFormer++ introduces a versatile vision transformer that explicitly leverages multi-scale features through cross-scale embedding and long-short distance attention, improving performance across various vision tasks.

Contribution

It proposes CrossFormer++, a novel vision transformer with cross-scale embedding, long-short distance attention, and techniques to address self-attention issues, enhancing multi-scale feature utilization.

Findings

01

Outperforms existing vision transformers on multiple tasks

02

Effectively incorporates multi-scale features into self-attention

03

Reduces computational load while maintaining performance

Abstract

While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens. Moreover, through experiments on CrossFormer, we observe another two issues that affect vision transformers' performance, i.e., the enlarging self-attention maps and amplitude…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cheerss/CrossFormer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors