ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts
Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin, He, Yingfei Sun, Zhenjun Han, Qi Tian

TL;DR
This paper presents ViMoE, a vision Mixture-of-Experts model integrated with Vision Transformer, analyzing its design, routing behavior, and proposing a shared expert to improve stability and efficiency in image classification and segmentation.
Contribution
It introduces ViMoE, an empirical study on vision MoE models, highlighting the importance of configuration, shared experts, and expert routing analysis for better performance.
Findings
Shared expert improves model stability.
Optimal MoE layer configuration is crucial.
Routing analysis reveals layer specialization.
Abstract
Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification and semantic segmentation. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful information. To address this, we introduce a shared expert to learn and capture common knowledge, serving as an effective way to construct stable ViMoE. Furthermore, we…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
This article studies the Vision MoE architecture. The article is easy to read and has a clear structure.
My doubts regarding this paper primarily stem from two major aspects, namely the experimental setup and the theoretical foundation. 1. At present, the most prominent application domain of the MoE lies in Vision - Language Models (VLM) and Large - Language Models (LLM). In this paper, although the focus is on investigating the role of MoE in visual representation learning, there is a significant shortcoming as it fails to provide validation within practical application scenarios. Practical scena
- The shows that a simple method (adding sparse MoE layers, then fine-tuning) can improve the classification performance of pre-trained ViTs. It also presents a series of ablation study on various design choices, which could be helpful for future works. - The main experimental results suggest that the proposed method is overall competitive. - The presentation is clear and easy to understand.
- Unfair comparison is a serious issue of this paper. 1) The method itself is a plug-and-play module and should not be stated as a new method in Fig.1 and Tab.2. Instead, it should be marked as DINOv2+ViMoE to avoid confusion. 2) It should be ensured that all other reported methods have gone through 200-epochs fine-tuning on ImageNet, and results of ViMoE+X (other backbones) should be reported. 3) ViMoE should be compared with other fine-tuning (adaptation) techniques, eg, SupCon, VPT, LoRA, etc
1. The paper writing is clear and easy to understand. 2. They propose a new MoE for vision task, which is easy to implement. 3. The analysis is sufficient.
1. They only explore the image classification task. 2. The dataset they used is common and corrupted. For example, ImageNet 1k has lot of samples with bad annotations. You should also try some better and hard datasets. 3. The network they proposed is not novel, which can be seen in many NLP papers.
- The study of optimal MoE configuration is an important issue. More and more models are turning to MoEs to increase either network capacity or for specialized subcomputations, and the authors’ object of study has potential to influence a lot of future model design in significant ways. - Independent of my reservations outlined below in the weaknesses section, the experimental section is relatively thorough, and I appreciate the multiple ways in which the authors attempt to visualize and evaluate
## [W1] Per-image expert assignment confounds conclusions The major results in the paper are derived from experiments exploring the impact of varying which layers use MoEs for ImageNET classification in ViTs. My main concern about the paper is that the authors’ design choice to share the expert assignment across *all* tokens in an image is problematic, and complicates their conclusions. I suspect this is underestimating the performance of networks trained with MoEs at earlier layers with a mor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpatial Cognition and Navigation
MethodsAttention Is All You Need · Dense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Mixture of Experts · Adam · Linear Layer · Softmax · Multi-Head Attention
