MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi, Jiang, Hongsheng Li, Yu Liu

TL;DR
MoVA introduces a novel multimodal large language model that adaptively routes and fuses multiple vision experts, significantly improving understanding across diverse image content types through a coarse-to-fine mechanism.
Contribution
The paper proposes MoVA, a new approach that adaptively combines multiple vision experts with a coarse-to-fine strategy, enhancing multimodal understanding in large language models.
Findings
MoVA outperforms state-of-the-art methods on various multimodal benchmarks.
The coarse-to-fine expert routing improves task-specific visual understanding.
Adaptive fusion of vision experts enhances generalization across diverse image types.
Abstract
As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeographic Information Systems Studies · Spatial Cognition and Navigation
MethodsAdapter · Contrastive Language-Image Pre-training
