Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
Jiaxing Chen, Yuxuan Liu, Dehu Li, Xiang An, Weimo Deng, Ziyong Feng,, Yongle Zhao, Yin Xie

TL;DR
This paper introduces P2G, a plug-and-play framework that enhances multimodal large language models' visual reasoning by employing expert agents for on-the-fly grounding, especially in high-resolution images, achieving performance comparable to GPT-4V.
Contribution
The paper presents P2G, a novel framework enabling external grounding in MLLMs, and introduces P2GB, a benchmark for evaluating reasoning in high-resolution images, demonstrating improved performance.
Findings
P2G achieves performance comparable to GPT-4V on P2GB.
Grounding with external agents improves reasoning in high-resolution images.
P2G outperforms baseline methods in visual reasoning tasks.
Abstract
The rise of Multimodal Large Language Models (MLLMs), renowned for their advanced instruction-following and reasoning capabilities, has significantly propelled the field of visual reasoning. However, due to limitations in their image tokenization processes, most MLLMs struggle to capture fine details of text and objects in images, especially in high-resolution samples. To overcome this limitation, we introduce P2G, a novel framework for plug-and-play grounding in MLLMs. P2G utilizes the tool-usage potential of MLLMs to employ expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images, thereby enabling deliberate reasoning through multimodal prompting. Additionally, we develop P2GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in challenging high-resolution images.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
