Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model
Chaoxiang Cai, Longrong Yang, Minghe Weng, Xuewei Li, Zequn Qin, Xi Li

TL;DR
This paper introduces a novel routing mechanism for mixture-of-experts in large vision-language models that accounts for modality-specific distribution differences, especially long-tailed vision token distributions.
Contribution
The proposed Long-Tailed Distribution-aware Router (LTDR) addresses modality-specific routing and dynamic expert activation, improving performance on vision-language and vision benchmarks.
Findings
Achieves 1.2%/2.1% improvement on vision-language benchmarks.
Achieves 1.6% improvement on vision benchmarks.
Effectively handles long-tailed vision token distributions.
Abstract
The mixture-of-experts (MoE) architecture, which replaces dense networks with sparse ones, has attracted significant attention in large vision-language models (LVLMs) for achieving comparable performance while activating far fewer parameters. Existing MoE architectures for LVLMs primarily focus on token-to-expert routing (TER), encouraging different experts to specialize in processing specific tokens. However, these methods typically rely on the load balancing mechanism, neglecting the inherent distributional differences between vision and language modalities. To address this limitation, we propose the Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, which tackles two key challenges: (1) Modality-specific distribution-aware routing. We observe that language TER generally follows a relatively uniform distribution, whereas vision TER exhibits a long-tailed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
