Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Mozhgan Nasr Azadani; James Riddell; Sean Sedwards; Krzysztof Czarnecki

arXiv:2501.06986·cs.CV·March 9, 2026

Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki

PDF

1 Repo

TL;DR

This paper systematically studies fusion strategies for Mixture of Vision Encoders in multimodal large language models, proposing a lightweight architecture called LEO that improves performance and generalizes well across diverse vision-language tasks.

Contribution

It introduces a set of principles for effective token-level fusion in MoVE-based MLLMs and presents LEO, a simple architecture that outperforms existing methods on multiple benchmarks.

Findings

01

LEO achieves superior results on most vision-language benchmarks.

02

LEO generalizes well to autonomous driving domain without architectural changes.

03

The proposed fusion principles enhance multimodal understanding in MLLMs.

Abstract

Mixture of Vision Encoders (MoVE) has emerged as a powerful approach to enhance the fine-grained visual understanding of multimodal large language models (MLLMs), improving their ability to handle tasks such as complex optical character recognition and scene understanding. Despite these advances, effectively combining diverse encoders and their visual tokens, while also scaling to high-resolution inputs, remains an open challenge. In this work, we conduct a systematic study of fusion designs for MoVE-based MLLMs, highlighting principles for token-level integration across complementary encoders. Our study shows that a lightweight recipe consisting of post-adaptation fusion with independent projectors, tile-level sequence interleaving, and dynamic tiling with global context delivers strong performance on diverse benchmarks. We integrate these principles into a simple and effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mozhgan91/leo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.