Unified Representation Space for 3D Visual Grounding

Yinuo Zheng; Lipeng Gu; Honghua Chen; Liangliang Nan; and Mingqiang Wei

arXiv:2506.14238·cs.CV·June 18, 2025

Unified Representation Space for 3D Visual Grounding

Yinuo Zheng, Lipeng Gu, Honghua Chen, Liangliang Nan, and Mingqiang Wei

PDF

Open Access

TL;DR

The paper introduces UniSpace-3D, a unified representation space for 3D visual grounding that effectively bridges the gap between visual and textual features, leading to improved accuracy in object identification.

Contribution

UniSpace-3D proposes a novel unified representation encoder, multi-modal contrastive learning, and language-guided query selection to enhance 3D visual grounding performance.

Findings

01

Outperforms baseline models by at least 2.24% on key datasets

02

Effectively reduces modality gap between visual and textual features

03

Demonstrates significant improvements in object positioning and classification accuracy

Abstract

3D visual grounding (3DVG) is a critical task in scene understanding that aims to identify objects in 3D scenes based on text descriptions. However, existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities in terms of spatial geometry and semantic categories. This discrepancy often causes errors in object positioning and classification. The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG, effectively bridging the gap between visual and textual features. Specifically, UniSpace-3D incorporates three innovative designs: i) a unified representation encoder that leverages the pre-trained CLIP model to map visual and textual features into a unified representation space, effectively bridging the gap between the two modalities; ii) a multi-modal contrastive learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage

MethodsContrastive Learning · Contrastive Language-Image Pre-training