Bridging the Gap Between Vision Transformers and Convolutional Neural   Networks on Small Datasets

Zhiying Lu; Hongtao Xie; Chuanbin Liu; Yongdong Zhang

arXiv:2210.05958·cs.CV·January 2, 2023·39 cites

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

Zhiying Lu, Hongtao Xie, Chuanbin Liu, Yongdong Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Dynamic Hybrid Vision Transformer (DHVT) that integrates convolutional structures and dynamic channel aggregation to improve ViT performance on small datasets, closing the gap with CNNs.

Contribution

The paper proposes DHVT, a novel hybrid model that enhances spatial relevance and channel diversity in ViTs, achieving state-of-the-art results on small datasets.

Findings

01

DHVT achieves 85.68% on CIFAR-100 with 22.8M parameters.

02

DHVT attains 82.3% on ImageNet-1K with 24.0M parameters.

03

DHVT outperforms previous ViT models on small datasets.

Abstract

There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arieseirack/dhvt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Infrared Target Detection Methodologies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Adam