3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment
Xiaoqi Li, Jiaming Liu, Nuowei Han, Liang Heng, Yandong Guo, Hao Dong,, Yang Liu

TL;DR
This paper introduces a weakly-supervised method for 3D visual grounding that leverages category and instance-level alignment to improve localization accuracy in point clouds without requiring detailed annotations.
Contribution
It proposes a novel approach combining category knowledge and spatial relationships to address ambiguity and complexity in 3D grounding tasks.
Findings
Achieves state-of-the-art results on Nr3D, Sr3D, and ScanRef benchmarks.
Effectively differentiates object categories and instances without detailed supervision.
Enhances 3D grounding accuracy using external detectors and language cues.
Abstract
The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse point cloud format, making category distinction challenging. Instance-level complexity stems from multiple instances of the same category coexisting in a scene, leading to distractions during grounding. To address these challenges, we propose a novel weakly-supervised grounding approach that explicitly differentiates between categories and instances. In the category-level branch, we utilize extensive category knowledge from a pre-trained external detector to align object proposal features with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsALIGN
