ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Weitai Kang; Weiming Zhuang; Zhizhong Li; Yan Yan; Lingjuan Lyu

arXiv:2508.08066·cs.CV·August 21, 2025

ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Weitai Kang, Weiming Zhuang, Zhizhong Li, Yan Yan, Lingjuan Lyu

PDF

Open Access

TL;DR

This paper systematically investigates design choices in visual grounding for multimodal large language models, providing insights and optimizations that significantly improve performance on key benchmarks.

Contribution

It offers a comprehensive analysis of visual grounding paradigms and data design choices, guiding better fine-tuning strategies for MLLMs in VG tasks.

Findings

01

Identifies the most effective visual grounding paradigm.

02

Provides ablation results on grounding data design.

03

Achieves up to +7.0% improvement on RefCOCOg.

Abstract

Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization