Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding
Yuzhen Li, Min Liu, Zhaoyang Li, Yuan Bian, Xueping Wang, Erbo Zhai, Yaonan Wang

TL;DR
This paper introduces Mono3DVG-EnSD, a novel framework for monocular 3D visual grounding that enhances spatial understanding and reduces cross-dimensional interference, achieving state-of-the-art results on the Mono3DRefer dataset.
Contribution
The paper proposes a new method with CLIP-LCA and D2M modules to improve spatial reasoning and dimension-specific feature extraction in monocular 3D visual grounding.
Findings
Achieves state-of-the-art performance on Mono3DRefer dataset.
Significantly improves Far([email protected]) by +13.54%.
Effectively reduces cross-dimensional interference in textual and visual features.
Abstract
Monocular 3D Visual Grounding (Mono3DVG) is an emerging task that locates 3D objects in RGB images using text descriptions with geometric cues. However, existing methods face two key limitations. Firstly, they often over-rely on high-certainty keywords that explicitly identify the target object while neglecting critical spatial descriptions. Secondly, generalized textual features contain both 2D and 3D descriptive information, thereby capturing an additional dimension of details compared to singular 2D or 3D visual features. This characteristic leads to cross-dimensional interference when refining visual features under text guidance. To overcome these challenges, we propose Mono3DVG-EnSD, a novel framework that integrates two key components: the CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) and the Dimension-Decoupled Module (D2M). The CLIP-LCA dynamically masks high-certainty…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
