CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang, Xintian Shen, Jiawei Chen, Hao Ma, Tao Wei

TL;DR
This paper introduces CGC, a cost-effective framework that enhances fine-grained multi-image understanding in multimodal models through compositional contrast and spatial reasoning, achieving state-of-the-art results.
Contribution
CGC leverages existing annotations to construct multi-image training instances with contrastive learning and rule-based spatial rewards, improving fine-grained understanding without extensive annotations.
Findings
CGC achieves state-of-the-art on MIG-Bench and VLM2-Bench.
It improves model performance on multiple multimodal reasoning tasks.
The framework enhances object constancy and spatial alignment in multi-image understanding.
Abstract
Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
