Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

Yue Xu; Kaizhi Yang; Jiebo Luo; Xuejin Chen

arXiv:2406.08907·cs.CV·June 14, 2024

Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

Yue Xu, Kaizhi Yang, Jiebo Luo, Xuejin Chen

PDF

Open Access

TL;DR

This paper introduces DASANet, a novel network that separately models and aligns object attributes and spatial relations between language and 3D point clouds, significantly improving 3D visual grounding accuracy.

Contribution

The paper proposes a dual-branch attention model that decomposes language and 3D inputs for better attribute and spatial relation alignment, enhancing interpretability and performance.

Findings

01

Achieves 65.1% accuracy on Nr3D dataset, surpassing previous methods.

02

Demonstrates high interpretability through visualization of dual branches.

03

Validates effectiveness of separate attribute and spatial relation modeling.

Abstract

3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual Attribute-Spatial relation Alignment Network that separately models and aligns object attributes and spatial relation features between language and 3D vision modalities. We decompose both the language and 3D point cloud input into two separate parts and design a dual-branch attention module to separately model the decomposed inputs while preserving global context in attribute-spatial feature fusion by cross attentions. Our DASANet achieves the highest grounding accuracy 65.1% on the Nr3D dataset, 1.3% higher than the best competitor. Besides, the visualization of the two branches proves that our method is efficient and highly interpretable.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques