MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision   Perception

Guanqun Wang; Xinyu Wei; Jiaming Liu; Ray Zhang; Yichi; Zhang; Kevin Zhang; Maurice Chong; Shanghang Zhang

arXiv:2406.15768·cs.CV·June 25, 2024

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi, Zhang, Kevin Zhang, Maurice Chong, Shanghang Zhang

PDF

Open Access

TL;DR

MR-MLLM introduces a framework that mutually enhances multimodal understanding and visual perception by integrating detailed visual inputs with language models, improving performance on complex tasks.

Contribution

The paper proposes a novel mutual reinforcement framework with shared query fusion, perception-enhanced cross-modal integration, and perception-embedded prompts, advancing multimodal comprehension and vision perception.

Findings

01

Superior performance on fine-grained multimodal tasks

02

Effective integration of perception outputs into language models

03

Enhanced understanding of corner case visual scenarios

Abstract

In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding, and vision perception models usually suffer from open-world distribution shifts due to their limited model capacity. To overcome these challenges, we propose the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models, enhancing multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems

MethodsFocus