CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
Chun-Fu Chen, Quanfu Fan, Rameswar Panda

TL;DR
CrossViT introduces a multi-scale vision transformer that combines different patch sizes using cross-attention, improving image classification accuracy efficiently over existing models.
Contribution
The paper proposes a dual-branch transformer with a novel cross-attention token fusion module for multi-scale feature learning in image classification.
Findings
Outperforms DeiT by 2% on ImageNet1K
Achieves better or comparable results to concurrent vision transformer models
Uses linear time cross-attention for efficient multi-scale feature fusion
Abstract
The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/crossvit_9_240.in1kmodel· 1.5k dl· ♡ 21.5k dl♡ 2
- 🤗timm/crossvit_9_dagger_240.in1kmodel· 50 dl50 dl
- 🤗timm/crossvit_15_240.in1kmodel· 288 dl288 dl
- 🤗timm/crossvit_15_dagger_240.in1kmodel· 101 dl101 dl
- 🤗timm/crossvit_15_dagger_408.in1kmodel· 40 dl40 dl
- 🤗timm/crossvit_18_240.in1kmodel· 103 dl103 dl
- 🤗timm/crossvit_18_dagger_240.in1kmodel· 59 dl59 dl
- 🤗timm/crossvit_18_dagger_408.in1kmodel· 37 dl37 dl
- 🤗timm/crossvit_base_240.in1kmodel· 181 dl181 dl
- 🤗timm/crossvit_small_240.in1kmodel· 252 dl252 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
Methods07 Ways to Reach To Someone At Expedia by Phone: Step-by-Step Guide · Linear Layer · Residual Connection · Concatenated Skip Connection · Layer Normalization · EXP-$Does Expedia refund a cancelled flight? · CrossViT · Softmax · Dense Connections · Attention Is All You Need
