CMT: Convolutional Neural Networks Meet Vision Transformers
Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu

TL;DR
This paper introduces CMT, a hybrid neural network combining CNNs and vision transformers, achieving superior accuracy and efficiency on image recognition tasks compared to previous models.
Contribution
The paper proposes a novel hybrid architecture that leverages transformers for long-range dependencies and CNNs for local features, resulting in a family of models with improved performance and efficiency.
Findings
CMT-S achieves 83.5% top-1 accuracy on ImageNet.
CMT-S is 14x and 2x smaller on FLOPs than DeiT and EfficientNet.
CMT models generalize well across multiple datasets.
Abstract
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. However, there are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs). In this paper, we aim to address this issue and develop a network that can outperform not only the canonical transformers, but also the high-performance convolutional models. We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features. Furthermore, we scale it to obtain a family of models, called CMTs, obtaining much better accuracy and efficiency than previous convolution and transformer based models. In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
MethodsAttention Is All You Need · Linear Layer · Pointwise Convolution · Depthwise Convolution · Softmax · Sigmoid Activation · Feedforward Network · Multi-Head Attention · Attention Dropout · Depthwise Separable Convolution
