CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

Lihao Zheng; Zhenwei Shao; Yu Zhou; Yan Yang; Xintian Shen; Jiawei Chen; Hao Ma; Tao Wei

arXiv:2604.22498·cs.CV·April 27, 2026

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang, Xintian Shen, Jiawei Chen, Hao Ma, Tao Wei

PDF

TL;DR

This paper introduces CGC, a cost-effective framework that enhances fine-grained multi-image understanding in multimodal models through compositional contrast and spatial reasoning, achieving state-of-the-art results.

Contribution

CGC leverages existing annotations to construct multi-image training instances with contrastive learning and rule-based spatial rewards, improving fine-grained understanding without extensive annotations.

Findings

01

CGC achieves state-of-the-art on MIG-Bench and VLM2-Bench.

02

It improves model performance on multiple multimodal reasoning tasks.

03

The framework enhances object constancy and spatial alignment in multi-image understanding.

Abstract

Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.