CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image   Classification

Chun-Fu Chen; Quanfu Fan; Rameswar Panda

arXiv:2103.14899·cs.CV·August 24, 2021·23 cites

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Chun-Fu Chen, Quanfu Fan, Rameswar Panda

PDF

Open Access 5 Repos 10 Models

TL;DR

CrossViT introduces a multi-scale vision transformer that combines different patch sizes using cross-attention, improving image classification accuracy efficiently over existing models.

Contribution

The paper proposes a dual-branch transformer with a novel cross-attention token fusion module for multi-scale feature learning in image classification.

Findings

01

Outperforms DeiT by 2% on ImageNet1K

02

Achieves better or comparable results to concurrent vision transformer models

03

Uses linear time cross-attention for efficient multi-scale feature fusion

Abstract

The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

Methods07 Ways to Reach To Someone At Expedia by Phone: Step-by-Step Guide · Linear Layer · Residual Connection · Concatenated Skip Connection · Layer Normalization · EXP-$Does Expedia refund a cancelled flight? · CrossViT · Softmax · Dense Connections · Attention Is All You Need