Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

Jiaxing Chen; Yuxuan Liu; Dehu Li; Xiang An; Weimo Deng; Ziyong Feng,; Yongle Zhao; Yin Xie

arXiv:2403.19322·cs.CV·June 19, 2024·1 cites

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

Jiaxing Chen, Yuxuan Liu, Dehu Li, Xiang An, Weimo Deng, Ziyong Feng,, Yongle Zhao, Yin Xie

PDF

Open Access 1 Datasets

TL;DR

This paper introduces P2G, a plug-and-play framework that enhances multimodal large language models' visual reasoning by employing expert agents for on-the-fly grounding, especially in high-resolution images, achieving performance comparable to GPT-4V.

Contribution

The paper presents P2G, a novel framework enabling external grounding in MLLMs, and introduces P2GB, a benchmark for evaluating reasoning in high-resolution images, demonstrating improved performance.

Findings

01

P2G achieves performance comparable to GPT-4V on P2GB.

02

Grounding with external agents improves reasoning in high-resolution images.

03

P2G outperforms baseline methods in visual reasoning tasks.

Abstract

The rise of Multimodal Large Language Models (MLLMs), renowned for their advanced instruction-following and reasoning capabilities, has significantly propelled the field of visual reasoning. However, due to limitations in their image tokenization processes, most MLLMs struggle to capture fine details of text and objects in images, especially in high-resolution samples. To overcome this limitation, we introduce P2G, a novel framework for plug-and-play grounding in MLLMs. P2G utilizes the tool-usage potential of MLLMs to employ expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images, thereby enabling deliberate reasoning through multimodal prompting. Additionally, we develop P2GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in challenging high-resolution images.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Valorix/P2GB
dataset· 11 dl
11 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems