VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li

TL;DR
VisMMoE introduces a system that leverages visual-expert affinity through token pruning to enhance the efficiency of visual-language MoE model deployment on memory-limited platforms, achieving significant speedups.
Contribution
The paper proposes a novel token pruning technique based on visual-expert affinity to improve expert locality and prefetching in VL-MoE models, enabling more efficient inference.
Findings
VisMMoE achieves up to 2.68x inference speedup.
Token pruning reduces expert working set size and stabilizes expert access patterns.
The system maintains competitive accuracy while improving efficiency.
Abstract
Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed for text-centric workloads and become much less effective for visual-heavy inputs, where large numbers of visual tokens induce broader and less predictable expert accesses. We present VisMMoE, a VL-MoE offloading system built on a single systems insight: pruning redundant visual tokens can improve offloading not only by reducing computation, but also by reshaping expert demand. We refer to this effect as \textit{visual-expert affinity}: token pruning makes expert accesses more concentrated within layers and more stable across layers, producing a smaller and more predictable expert working set. Guided by this insight, VisMMoE combines affinity-aware token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
