MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained   Vision-Language Understanding

Yue Cao; Yangzhou Liu; Zhe Chen; Guangchen Shi; Wenhai Wang; Danhuai; Zhao; Tong Lu

arXiv:2410.11829·cs.CV·October 16, 2024·3 cites

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

Yue Cao, Yangzhou Liu, Zhe Chen, Guangchen Shi, Wenhai Wang, Danhuai, Zhao, Tong Lu

PDF

Open Access 1 Repo

TL;DR

MMFuser enhances vision-language understanding by efficiently integrating multi-layer features from Vision Transformers, capturing fine-grained details without added redundancy, leading to improved benchmark performance.

Contribution

Introduces a multi-layer feature fuser that dynamically combines deep and shallow features from ViTs, improving visual detail representation in MLLMs.

Findings

01

Significant performance improvements on benchmarks.

02

Efficient integration of multi-layer features.

03

Lightweight alternative to multi-encoder methods.

Abstract

Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuecao0119/MMFuser
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques