VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

Weitai Kang; Jason Kuen; Mengwei Ren; Zijun Wei; Yan Yan; Kangning Liu

arXiv:2512.11099·cs.CV·December 15, 2025

VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, Kangning Liu

PDF

Open Access

TL;DR

VGent introduces a modular visual grounding model that disentangles reasoning and prediction, leveraging a frozen MLLM and detector-based queries for fast, accurate multi-target grounding, achieving state-of-the-art results.

Contribution

The paper presents VGent, a novel modular encoder-decoder architecture that separates reasoning from prediction, enabling fast inference and modular upgrades for improved visual grounding performance.

Findings

01

Achieves +20.6% F1 improvement over prior methods.

02

Boosts gIoU by +8.2% and cIoU by +5.8%.

03

Maintains constant, fast inference latency.

Abstract

Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning