VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

Cheng Xu; Xiaofeng Hou; Jiacheng Liu; Chao Li

arXiv:2605.05899·cs.LG·May 8, 2026

VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li

PDF

TL;DR

VisMMoE introduces a system that leverages visual-expert affinity through token pruning to enhance the efficiency of visual-language MoE model deployment on memory-limited platforms, achieving significant speedups.

Contribution

The paper proposes a novel token pruning technique based on visual-expert affinity to improve expert locality and prefetching in VL-MoE models, enabling more efficient inference.

Findings

01

VisMMoE achieves up to 2.68x inference speedup.

02

Token pruning reduces expert working set size and stabilizes expert access patterns.

03

The system maintains competitive accuracy while improving efficiency.

Abstract

Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed for text-centric workloads and become much less effective for visual-heavy inputs, where large numbers of visual tokens induce broader and less predictable expert accesses. We present VisMMoE, a VL-MoE offloading system built on a single systems insight: pruning redundant visual tokens can improve offloading not only by reducing computation, but also by reshaping expert demand. We refer to this effect as \textit{visual-expert affinity}: token pruning makes expert accesses more concentrated within layers and more stable across layers, producing a smaller and more predictable expert working set. Guided by this insight, VisMMoE combines affinity-aware token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.