Vision Transformer with Convolutions Architecture Search
Haichao Zhang, Kuangrong Hao, Witold Pedrycz, Lei Gao, Xuesong Tang,, and Bing Wei

TL;DR
This paper introduces VTCAS, an architecture search method that combines convolutional features with Transformer models, resulting in a versatile backbone that improves performance on image classification and object detection tasks.
Contribution
It proposes a novel architecture search approach that integrates convolutional features into Transformer models, enhancing their robustness and multi-scale feature extraction capabilities.
Findings
Achieved 82.0% Top-1 accuracy on ImageNet-1K.
Obtained 50.4% mAP on COCO2017 for object detection.
Enhanced robustness in low illumination indoor scenes.
Abstract
Transformers exhibit great advantages in handling computer vision tasks. They model image classification tasks by utilizing a multi-head attention mechanism to process a series of patches consisting of split images. However, for complex tasks, Transformer in computer vision not only requires inheriting a bit of dynamic attention and global context, but also needs to introduce features concerning noise reduction, shifting, and scaling invariance of objects. Therefore, here we take a step forward to study the structural characteristics of Transformer and convolution and propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS). The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture while maintaining the benefits of the multi-head attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Label Smoothing · Dropout
