Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Yongqi Huang, Peng Ye, Xiaoshui Huang, Sheng Li, Tao Chen, Tong He,, Wanli Ouyang

TL;DR
This paper introduces a novel training scheme for Vision Transformers that uses Experts Weights Averaging with MoEs during training, improving performance without increasing inference costs and enabling effective fine-tuning.
Contribution
The paper proposes a new training strategy for ViTs that decouples training and inference, utilizing MoEs and Experts Weights Averaging to enhance performance without inference overhead.
Findings
Effective performance improvement across multiple visual tasks and datasets.
EWA significantly boosts naive MoE performance on small datasets.
Theoretical analysis explains the underlying mechanism of the training scheme.
Abstract
Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Image Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques
