GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Yufei Zhan, Ming Tang, Jinqiao Wang

TL;DR
This paper introduces GeM-VG, a multimodal large language model designed for generalized multi-image visual grounding, supported by a new dataset and a hybrid finetuning strategy, achieving superior performance across various tasks.
Contribution
It presents a unified model for diverse multi-image grounding tasks, introduces the MG-Data-240K dataset, and proposes a hybrid reinforcement finetuning method to enhance reasoning and perception.
Findings
Outperforms previous MLLMs on MIG-Bench and MC-Bench by 2.0% and 9.7%.
Achieves 9.1% improvement on ODINW for single-image grounding.
Demonstrates strong multi-image understanding capabilities.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
