Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding

Yuzhen Li; Min Liu; Yuan Bian; Xueping Wang; Zhaoyang Li; Gen Li; Yaonan Wang

arXiv:2508.19165·cs.CV·August 27, 2025

Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding

Yuzhen Li, Min Liu, Yuan Bian, Xueping Wang, Zhaoyang Li, Gen Li, Yaonan Wang

PDF

TL;DR

This paper introduces two enhancement methods to improve 3D perception in monocular 3D visual grounding by addressing the weak understanding of measurement units in text embeddings, leading to significant accuracy improvements.

Contribution

The paper proposes 3D-text Enhancement and Text-Guided Geometry Enhancement modules to boost the comprehension of text and geometry features in 3D visual grounding tasks.

Findings

01

Achieved a 11.94% accuracy gain in the 'Far' scenario.

02

Outperformed previous methods on the Mono3DRefer dataset.

03

Demonstrated the effectiveness of the proposed enhancements through extensive experiments.

Abstract

Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit "meter" to "decimeters" or "centimeters" leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.