Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

Shengcao Cao; Liang-Yan Gui; Yu-Xiong Wang

arXiv:2410.08209·cs.CV·October 17, 2025

Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

Shengcao Cao, Liang-Yan Gui, Yu-Xiong Wang

PDF

Open Access 1 Models

TL;DR

This paper demonstrates that large multimodal models can develop emergent visual grounding abilities without explicit supervision, using attention-based segmentation and diffusion-encoded visual features, leading to strong performance on various benchmarks.

Contribution

The paper introduces an attention-based segmentation method and a diffusion-based visual encoder to enable emergent grounding in LMMs trained without explicit grounding supervision.

Findings

01

Achieved 44.2% grounding mask recall without grounding supervision.

02

Outperformed supervised models like GLaMM on grounding tasks.

03

Demonstrated generalization and scalability of emergent grounding abilities.

Abstract

Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Shengcao1006/difflmm-llava-v1.5-7b-lora
model· 46 dl
46 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSeismology and Earthquake Studies

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training