Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision
Shengcao Cao, Liang-Yan Gui, Yu-Xiong Wang

TL;DR
This paper demonstrates that large multimodal models can develop emergent visual grounding abilities without explicit supervision, using attention-based segmentation and diffusion-encoded visual features, leading to strong performance on various benchmarks.
Contribution
The paper introduces an attention-based segmentation method and a diffusion-based visual encoder to enable emergent grounding in LMMs trained without explicit grounding supervision.
Findings
Achieved 44.2% grounding mask recall without grounding supervision.
Outperformed supervised models like GLaMM on grounding tasks.
Demonstrated generalization and scalability of emergent grounding abilities.
Abstract
Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSeismology and Earthquake Studies
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
