MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Zhuofan Zong; Bingqi Ma; Dazhong Shen; Guanglu Song; Hao Shao; Dongzhi; Jiang; Hongsheng Li; Yu Liu

arXiv:2404.13046·cs.CV·November 1, 2024·5 cites

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi, Jiang, Hongsheng Li, Yu Liu

PDF

Open Access 1 Repo 1 Models

TL;DR

MoVA introduces a novel multimodal large language model that adaptively routes and fuses multiple vision experts, significantly improving understanding across diverse image content types through a coarse-to-fine mechanism.

Contribution

The paper proposes MoVA, a new approach that adaptively combines multiple vision experts with a coarse-to-fine strategy, enhancing multimodal understanding in large language models.

Findings

01

MoVA outperforms state-of-the-art methods on various multimodal benchmarks.

02

The coarse-to-fine expert routing improves task-specific visual understanding.

03

Adaptive fusion of vision experts enhances generalization across diverse image types.

Abstract

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

templex98/mova
pytorchOfficial

Models

🤗
zongzhuofan/llama3-mova-8b
model· 22 dl· ♡ 3
22 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies · Spatial Cognition and Navigation

MethodsAdapter · Contrastive Language-Image Pre-training