MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
Bo Li, Chuan Wu, shaolin Zhu

TL;DR
MACS is a training-free inference framework that improves the efficiency of multimodal MoE large language models by addressing load balancing challenges through modality-aware capacity scaling.
Contribution
It introduces a novel, training-free approach with entropy-weighted load and dynamic capacity mechanisms to better balance expert resources in multimodal inference.
Findings
MACS significantly reduces inference bottlenecks in multimodal MoE models.
It outperforms existing load balancing methods on multiple benchmarks.
MACS effectively adapts to varying modal compositions during inference.
Abstract
Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-count-based load balancing methods fail to address two unique challenges: (1) Information Heterogeneity, where numerous redundant visual tokens are treated equally to semantically critical ones, and (2) Modality Dynamics, where varying visual to text ratios across tasks lead to resource misallocation. To address these challenges, we propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework. Specifically, MACS introduces an Entropy-Weighted Load mechanism to quantify the semantic value of visual tokens, addressing information heterogeneity. Additionally, the Dynamic Modality-Adaptive Capacity mechanism allocates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
