Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion
Zhuokun Chen, Jinwu Hu, Zeshuai Deng, Yufeng Wang, Bohan Zhuang,, Mingkui Tan

TL;DR
This paper introduces VisionFuse, a training-free framework that combines multiple off-the-shelf vision encoders within multimodal large language models to enhance visual perception efficiently, without additional training.
Contribution
VisionFuse leverages the alignment of feature distributions within MLLM families and concatenates tokens from multiple encoders to improve multimodal task performance without retraining.
Findings
Achieves over 4% average performance boost on multimodal benchmarks.
Effectively utilizes multiple vision encoders without additional training.
Reduces deployment overhead by merging language model parameters.
Abstract
Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Speech and dialogue systems · Natural Language Processing Techniques
MethodsALIGN · Focus
