Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
Zhiying Lu, Hongtao Xie, Chuanbin Liu, Yongdong Zhang

TL;DR
This paper introduces a Dynamic Hybrid Vision Transformer (DHVT) that integrates convolutional structures and dynamic channel aggregation to improve ViT performance on small datasets, closing the gap with CNNs.
Contribution
The paper proposes DHVT, a novel hybrid model that enhances spatial relevance and channel diversity in ViTs, achieving state-of-the-art results on small datasets.
Findings
DHVT achieves 85.68% on CIFAR-100 with 22.8M parameters.
DHVT attains 82.3% on ImageNet-1K with 24.0M parameters.
DHVT outperforms previous ViT models on small datasets.
Abstract
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Infrared Target Detection Methodologies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Adam
