M3LLM: Model Context Protocol-aided Mixture of Vision Experts For Multimodal LLMs in Networks
Yongjie Zeng, Hongyang Du

TL;DR
M3LLM introduces a distributed multimodal large language model framework that dynamically coordinates vision experts via a novel protocol, improving accuracy and efficiency in wireless network environments.
Contribution
The paper proposes the Model Context Protocol (MCP) for dynamic coordination of vision experts in multimodal LLMs, enabling distributed inference on resource-constrained devices.
Findings
Improves task accuracy in multimodal inference.
Reduces communication costs in wireless networks.
Enhances expert routing adaptability under dynamic conditions.
Abstract
Current Multimodal Large Language Models (MLLMs) rely on centralized architectures and often suffer from poor alignment between the input task and their fixed visual encoding modules, which limits performance on diverse and dynamic visual tasks. With the increasing deployment of resource-efficient models on edge devices in wireless networks, a new opportunity emerges to dynamically use distributed vision experts for improved MLLM inference quality. To enable this, we propose M3LLM, where the Model Context Protocol (MCP) coordinates a mixture of vision experts to achieve distributed MLLMs. Specifically, MCP is an open protocol that structures the input task context into interpretable representations, enabling wireless network-aware coordination between the central model backbone and edge-hosted vision experts. Based on the MCP representation, M3LLM formulates vision expert routing as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Data and IoT Technologies
