Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Cong Wang; Hongmin Xu; Xiong Zhang; Li Wang; Zhitong Zheng; and; Haifeng Liu

arXiv:2207.13317·cs.CV·August 2, 2022·1 cites

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Cong Wang, Hongmin Xu, Xiong Zhang, Li Wang, Zhitong Zheng, and, Haifeng Liu

PDF

Open Access

TL;DR

This paper demonstrates that integrating convolutional embedding layers into hierarchical Vision Transformers significantly enhances their performance across multiple vision tasks by providing beneficial inductive biases.

Contribution

The study systematically explores how macro architecture and convolutional embedding improve ViT performance, introducing CETNets as efficient hybrid CNN/ViT backbones.

Findings

01

CETNets achieve 84.9% Top-1 accuracy on ImageNet-1K

02

Effective boosting of state-of-the-art ViTs with convolutional embedding

03

Improved performance on COCO and ADE20K benchmarks

Abstract

Vision Transformers (ViTs) have recently dominated a range of computer vision tasks, yet it suffers from low training data efficiency and inferior local semantic representation capability without appropriate inductive bias. Convolutional neural networks (CNNs) inherently capture regional-aware semantics, inspiring researchers to introduce CNNs back into the architecture of the ViTs to provide desirable inductive bias for ViTs. However, is the locality achieved by the micro-level CNNs embedded in ViTs good enough? In this paper, we investigate the problem by profoundly exploring how the macro architecture of the hybrid CNNs/ViTs enhances the performances of hierarchical ViTs. Particularly, we study the role of token embedding layers, alias convolutional embedding (CE), and systemically reveal how CE injects desirable inductive bias in ViTs. Besides, we apply the optimal CE configuration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors