ViTAS: Vision Transformer Architecture Search

Xiu Su; Shan You; Jiyang Xie; Mingkai Zheng; Fei Wang; Chen Qian,; Changshui Zhang; Xiaogang Wang; Chang Xu

arXiv:2106.13700·cs.CV·December 1, 2021·1 cites

ViTAS: Vision Transformer Architecture Search

Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian,, Changshui Zhang, Xiaogang Wang, Chang Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces ViTAS, a novel neural architecture search method for vision transformers that stabilizes training and improves performance by addressing token embedding imbalance and superformer issues.

Contribution

We propose a cyclic weight-sharing mechanism and identity shifting to enhance the stability and effectiveness of NAS for vision transformers.

Findings

01

Achieved 82.0% ImageNet accuracy with 3.0G FLOPs.

02

Outperformed existing ViTs by 2.4% mAP on COCO2017.

03

Demonstrated significant improvements over baseline architectures.

Abstract

Vision transformers (ViTs) inherited the success of NLP but their structures have not been sufficiently investigated and optimized for visual tasks. One of the simplest solutions is to directly search the optimal one via the widely used neural architecture search (NAS) in CNNs. However, we empirically find this straightforward adaptation would encounter catastrophic failures and be frustratingly unstable for the training of superformer. In this paper, we argue that since ViTs mainly operate on token embeddings with little inductive bias, imbalance of channels for different architectures would worsen the weight-sharing assumption and cause the training instability as a result. Therefore, we develop a new cyclic weight-sharing mechanism for token embeddings of the ViTs, which enables each channel could more evenly contribute to all candidate architectures. Besides, we also propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiusu/ViTAS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Feedforward Network · Dropout · Attention Dropout · Data-efficient Image Transformer