AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert
Yuting Gao, Wang Lan, Hengyuan Zhao, Linjiang Huang, Si Liu, Qingpei Guo

TL;DR
AnyExperts introduces a dynamic, importance-aware expert routing framework for multimodal MoE models, optimizing resource allocation and maintaining high performance across vision, audio, and NLP tasks.
Contribution
It proposes a novel on-demand, budget-aware routing strategy that adaptively allocates real and virtual experts based on semantic importance, improving efficiency.
Findings
Achieves 40% fewer real expert activations on image/video tasks.
Maintains performance while reducing real expert usage by 10% on text-dense tasks.
Enhances efficiency and effectiveness of multimodal MoE models.
Abstract
Multimodal Mixture-of-Experts (MoE) models offer a promising path toward scalable and efficient large vision-language systems. However, existing approaches rely on rigid routing strategies (typically activating a fixed number of experts per token) ignoring the inherent heterogeneity in semantic importance across modalities. This leads to suboptimal compute allocation, where redundant tokens consume as many resources as critical ones. To address this, we propose AnyExperts, a novel on-demand, budget-aware dynamic routing framework that allocates a variable total number of expert slots per token based on its semantic importance. Crucially, to prevent uncontrolled compute growth, the total slots per token are constrained within a fixed range, and each slot is filled by either a real expert or a virtual expert, with the virtual share capped at a small maximum (e.g., 20%). The model then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Mobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning
