ViTAS: Vision Transformer Architecture Search
Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian,, Changshui Zhang, Xiaogang Wang, Chang Xu

TL;DR
This paper introduces ViTAS, a novel neural architecture search method for vision transformers that stabilizes training and improves performance by addressing token embedding imbalance and superformer issues.
Contribution
We propose a cyclic weight-sharing mechanism and identity shifting to enhance the stability and effectiveness of NAS for vision transformers.
Findings
Achieved 82.0% ImageNet accuracy with 3.0G FLOPs.
Outperformed existing ViTs by 2.4% mAP on COCO2017.
Demonstrated significant improvements over baseline architectures.
Abstract
Vision transformers (ViTs) inherited the success of NLP but their structures have not been sufficiently investigated and optimized for visual tasks. One of the simplest solutions is to directly search the optimal one via the widely used neural architecture search (NAS) in CNNs. However, we empirically find this straightforward adaptation would encounter catastrophic failures and be frustratingly unstable for the training of superformer. In this paper, we argue that since ViTs mainly operate on token embeddings with little inductive bias, imbalance of channels for different architectures would worsen the weight-sharing assumption and cause the training instability as a result. Therefore, we develop a new cyclic weight-sharing mechanism for token embeddings of the ViTs, which enables each channel could more evenly contribute to all candidate architectures. Besides, we also propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Feedforward Network · Dropout · Attention Dropout · Data-efficient Image Transformer
