MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
Chaoya Jiang, Jia Hongrui, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan,, Ji Zhang, Fei Huang, Shikun Zhang

TL;DR
MaVEn is a novel multi-granularity visual encoding framework that improves multimodal large language models' multi-image reasoning by combining semantic symbols with detailed features, enhancing understanding and efficiency.
Contribution
MaVEn introduces a dual encoding approach with a dynamic reduction mechanism, advancing multi-image reasoning capabilities in MLLMs beyond existing single-image focused models.
Findings
Significantly improves multi-image reasoning accuracy.
Enhances single-image understanding performance.
Increases processing efficiency for long visual sequences.
Abstract
This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on single-image visual understanding, limiting their ability to interpret and integrate information across multiple images. MaVEn addresses this limitation by combining discrete visual symbol sequences, which abstract coarse-grained semantic concepts, with traditional continuous representation sequences that model fine-grained features. This dual approach bridges the semantic gap between visual and textual data, thereby improving the model's ability to process and interpret information from multiple images effectively. Additionally, we design a dynamic reduction mechanism by for long-sequence continuous features to enhance multi-image processing efficiency.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Subtitles and Audiovisual Media
MethodsFocus
