MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused   Vision-Language Processing

Matvey Skripkin; Elizaveta Goncharova; Dmitrii Tarasov; Andrey; Kuznetsov

arXiv:2502.15381·cs.CV·February 24, 2025

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Matvey Skripkin, Elizaveta Goncharova, Dmitrii Tarasov, Andrey, Kuznetsov

PDF

TL;DR

MOVE introduces a mixture-of-vision-encoders approach that dynamically selects the best encoder for domain-specific multimodal tasks, improving accuracy without complex image processing.

Contribution

It proposes a novel method to leverage multiple pre-trained vision encoders for enhanced multimodal performance across diverse domains.

Findings

01

Achieves competitive accuracy on multiple benchmarks.

02

Effectively routes inputs to suitable encoders for domain-specific tasks.

03

Reduces complexity by avoiding image slicing for high-resolution images.

Abstract

Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model's performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training