ConTNet: Why not use convolution and transformer at the same time?
Haotian Yan, Zhe Li, Weijian Li, Changhu Wang, Ming Wu, Chuang Zhang

TL;DR
ConTNet combines convolutional networks with transformers to capture global information efficiently, achieving high accuracy and robustness in image classification and detection tasks with less computational cost.
Contribution
This work introduces ConTNet, a novel architecture that integrates convolution and transformer components, offering improved performance and robustness over existing models like ViT and ResNet.
Findings
ConTNet achieves 81.8% top-1 accuracy on ImageNet.
ConTNet outperforms ResNet50 as a backbone in object detection tasks.
ConTNet has less than 40% of the computational complexity of DeiT-B.
Abstract
Although convolutional networks (ConvNets) have enjoyed great success in computer vision (CV), it suffers from capturing global information crucial to dense prediction tasks such as object detection and segmentation. In this work, we innovatively propose ConTNet (ConvolutionTransformer Network), combining transformer with ConvNet architectures to provide large receptive fields. Unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and extremely dependent on a pile of data augmentations when trained from scratch on a midsize dataset (e.g., ImageNet1k), ConTNet can be optimized like normal ConvNets (e.g., ResNet) and preserve an outstanding robustness. It is also worth pointing that, given identical strong data augmentations, the performance improvement of ConTNet is more remarkable than that of ResNet. We present its superiority…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Max Pooling · Convolution · 1x1 Convolution · Average Pooling · Residual Block · Batch Normalization · Kaiming Initialization · Bottleneck Residual Block
