MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception
Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi, Zhang, Kevin Zhang, Maurice Chong, Shanghang Zhang

TL;DR
MR-MLLM introduces a framework that mutually enhances multimodal understanding and visual perception by integrating detailed visual inputs with language models, improving performance on complex tasks.
Contribution
The paper proposes a novel mutual reinforcement framework with shared query fusion, perception-enhanced cross-modal integration, and perception-embedded prompts, advancing multimodal comprehension and vision perception.
Findings
Superior performance on fine-grained multimodal tasks
Effective integration of perception outputs into language models
Enhanced understanding of corner case visual scenarios
Abstract
In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding, and vision perception models usually suffer from open-world distribution shifts due to their limited model capacity. To overcome these challenges, we propose the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models, enhancing multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems
MethodsFocus
