TL;DR
This paper systematically studies fusion strategies for Mixture of Vision Encoders in multimodal large language models, proposing a lightweight architecture called LEO that improves performance and generalizes well across diverse vision-language tasks.
Contribution
It introduces a set of principles for effective token-level fusion in MoVE-based MLLMs and presents LEO, a simple architecture that outperforms existing methods on multiple benchmarks.
Findings
LEO achieves superior results on most vision-language benchmarks.
LEO generalizes well to autonomous driving domain without architectural changes.
The proposed fusion principles enhance multimodal understanding in MLLMs.
Abstract
Mixture of Vision Encoders (MoVE) has emerged as a powerful approach to enhance the fine-grained visual understanding of multimodal large language models (MLLMs), improving their ability to handle tasks such as complex optical character recognition and scene understanding. Despite these advances, effectively combining diverse encoders and their visual tokens, while also scaling to high-resolution inputs, remains an open challenge. In this work, we conduct a systematic study of fusion designs for MoVE-based MLLMs, highlighting principles for token-level integration across complementary encoders. Our study shows that a lightweight recipe consisting of post-adaptation fusion with independent projectors, tile-level sequence interleaving, and dynamic tiling with global context delivers strong performance on diverse benchmarks. We integrate these principles into a simple and effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
