Vision Transformer with Convolutions Architecture Search

Haichao Zhang; Kuangrong Hao; Witold Pedrycz; Lei Gao; Xuesong Tang,; and Bing Wei

arXiv:2203.10435·cs.CV·March 22, 2022·5 cites

Vision Transformer with Convolutions Architecture Search

Haichao Zhang, Kuangrong Hao, Witold Pedrycz, Lei Gao, Xuesong Tang,, and Bing Wei

PDF

Open Access

TL;DR

This paper introduces VTCAS, an architecture search method that combines convolutional features with Transformer models, resulting in a versatile backbone that improves performance on image classification and object detection tasks.

Contribution

It proposes a novel architecture search approach that integrates convolutional features into Transformer models, enhancing their robustness and multi-scale feature extraction capabilities.

Findings

01

Achieved 82.0% Top-1 accuracy on ImageNet-1K.

02

Obtained 50.4% mAP on COCO2017 for object detection.

03

Enhanced robustness in low illumination indoor scenes.

Abstract

Transformers exhibit great advantages in handling computer vision tasks. They model image classification tasks by utilizing a multi-head attention mechanism to process a series of patches consisting of split images. However, for complex tasks, Transformer in computer vision not only requires inheriting a bit of dynamic attention and global context, but also needs to introduce features concerning noise reduction, shifting, and scaling invariance of objects. Therefore, here we take a step forward to study the structural characteristics of Transformer and convolution and propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS). The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture while maintaining the benefits of the multi-head attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Label Smoothing · Dropout