GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

Shurong Zheng; Yousong Zhu; Hongyin Zhao; Fan Yang; Yufei Zhan; Ming Tang; Jinqiao Wang

arXiv:2601.04777·cs.CV·January 9, 2026

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Yufei Zhan, Ming Tang, Jinqiao Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces GeM-VG, a multimodal large language model designed for generalized multi-image visual grounding, supported by a new dataset and a hybrid finetuning strategy, achieving superior performance across various tasks.

Contribution

It presents a unified model for diverse multi-image grounding tasks, introduces the MG-Data-240K dataset, and proposes a hybrid reinforcement finetuning method to enhance reasoning and perception.

Findings

01

Outperforms previous MLLMs on MIG-Bench and MC-Bench by 2.0% and 9.7%.

02

Achieves 9.1% improvement on ODINW for single-image grounding.

03

Demonstrates strong multi-image understanding capabilities.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning