Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo,, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun

TL;DR
Migician introduces a novel multi-image grounding model capable of free-form, accurate grounding across multiple images, supported by a new dataset and benchmark, significantly outperforming existing models in multi-image comprehension tasks.
Contribution
The paper presents Migician, the first multi-image grounding model for free-form, precise grounding, along with the MGrounding-630k dataset and MIG-Bench for evaluation.
Findings
Outperforms existing MLLMs by 24.94% in multi-image grounding.
Surpasses larger 70B models in multi-image grounding tasks.
Achieves state-of-the-art results in multi-image comprehension.
Abstract
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
