DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Zhou Tao; Shida Wang; Yongxiang Hua; Haoyu Cao; Linli Xu

arXiv:2512.12633·cs.CV·March 18, 2026

DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Zhou Tao, Shida Wang, Yongxiang Hua, Haoyu Cao, Linli Xu

PDF

Open Access

TL;DR

This paper introduces DiG, a novel training framework for multimodal large language models that enhances their fine-grained visual perception and spatial reasoning by learning to identify differences between image pairs, supported by a scalable data generation pipeline.

Contribution

We propose Differential Grounding (DiG), a new proxy task framework with an automated 3D rendering pipeline and curriculum learning to improve fine-grained perception in MLLMs.

Findings

01

Significant performance improvements on visual perception benchmarks.

02

Effective transfer of fine-grained perception skills to downstream tasks.

03

Robustness of the approach across various multimodal perception benchmarks.

Abstract

Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis