SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
Zi-Hao Bo, Yaqian Li, Anzhou Hou, Rinyoichi Takezoe, Ertao Zhao, Tianxiang Pan, Jiale Yan, Mo Guang, Kaiwen Long

TL;DR
This paper introduces SMoES, a new modality-guided expert routing method for MoE-based vision-language models that improves task performance and deployment efficiency by leveraging layer-dependent modality fusion patterns.
Contribution
SMoES proposes dynamic soft modality scores, an expert binning mechanism, and mutual information regularization to enhance expert specialization in MoE-VLMs.
Findings
Achieves 0.9% and 4.2% average gains on multimodal and language tasks.
Reduces EP communication overhead by 56.1%.
Improves throughput by 12.3% in deployment.
Abstract
Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization. We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization. Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization. Experiments across four MoE-based VLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
