Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding

Yuzhen Li; Min Liu; Zhaoyang Li; Yuan Bian; Xueping Wang; Erbo Zhai; Yaonan Wang

arXiv:2511.06908·cs.CV·November 11, 2025

Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding

Yuzhen Li, Min Liu, Zhaoyang Li, Yuan Bian, Xueping Wang, Erbo Zhai, Yaonan Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces Mono3DVG-EnSD, a novel framework for monocular 3D visual grounding that enhances spatial understanding and reduces cross-dimensional interference, achieving state-of-the-art results on the Mono3DRefer dataset.

Contribution

The paper proposes a new method with CLIP-LCA and D2M modules to improve spatial reasoning and dimension-specific feature extraction in monocular 3D visual grounding.

Findings

01

Achieves state-of-the-art performance on Mono3DRefer dataset.

02

Significantly improves Far([email protected]) by +13.54%.

03

Effectively reduces cross-dimensional interference in textual and visual features.

Abstract

Monocular 3D Visual Grounding (Mono3DVG) is an emerging task that locates 3D objects in RGB images using text descriptions with geometric cues. However, existing methods face two key limitations. Firstly, they often over-rely on high-certainty keywords that explicitly identify the target object while neglecting critical spatial descriptions. Secondly, generalized textual features contain both 2D and 3D descriptive information, thereby capturing an additional dimension of details compared to singular 2D or 3D visual features. This characteristic leads to cross-dimensional interference when refining visual features under text guidance. To overcome these challenges, we propose Mono3DVG-EnSD, a novel framework that integrates two key components: the CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) and the Dimension-Decoupled Module (D2M). The CLIP-LCA dynamically masks high-certainty…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques