Convolutional Embedding Makes Hierarchical Vision Transformer Stronger
Cong Wang, Hongmin Xu, Xiong Zhang, Li Wang, Zhitong Zheng, and, Haifeng Liu

TL;DR
This paper demonstrates that integrating convolutional embedding layers into hierarchical Vision Transformers significantly enhances their performance across multiple vision tasks by providing beneficial inductive biases.
Contribution
The study systematically explores how macro architecture and convolutional embedding improve ViT performance, introducing CETNets as efficient hybrid CNN/ViT backbones.
Findings
CETNets achieve 84.9% Top-1 accuracy on ImageNet-1K
Effective boosting of state-of-the-art ViTs with convolutional embedding
Improved performance on COCO and ADE20K benchmarks
Abstract
Vision Transformers (ViTs) have recently dominated a range of computer vision tasks, yet it suffers from low training data efficiency and inferior local semantic representation capability without appropriate inductive bias. Convolutional neural networks (CNNs) inherently capture regional-aware semantics, inspiring researchers to introduce CNNs back into the architecture of the ViTs to provide desirable inductive bias for ViTs. However, is the locality achieved by the micro-level CNNs embedded in ViTs good enough? In this paper, we investigate the problem by profoundly exploring how the macro architecture of the hybrid CNNs/ViTs enhances the performances of hierarchical ViTs. Particularly, we study the role of token embedding layers, alias convolutional embedding (CE), and systemically reveal how CE injects desirable inductive bias in ViTs. Besides, we apply the optimal CE configuration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
