Data-Efficient 3D Visual Grounding via Order-Aware Referring
Tung-Yu Wu, Sheng-Yu Huang, Yu-Chiang Frank Wang

TL;DR
Vigor is a data-efficient 3D visual grounding framework that uses order-aware referring and large language models to improve accuracy in low-resource scenarios without requiring detailed supervision.
Contribution
The paper introduces Vigor, a novel framework leveraging LLMs and order-aware training for efficient 3D visual grounding with minimal supervision.
Findings
Vigor outperforms state-of-the-art methods by 9.3% and 7.6% in low-resource settings.
The approach effectively captures complex verbo-visual relations.
Vigor demonstrates superior performance on NR3D and ScanRefer datasets.
Abstract
3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. Previous works usually require significant data relating to point color and their descriptions to exploit the corresponding complicated verbo-visual relations. In our work, we introduce Vigor, a novel Data-Efficient 3D Visual Grounding framework via Order-aware Referring. Vigor leverages LLM to produce a desirable referential order from the input description for 3D visual grounding. With the proposed stacked object-referring blocks, the predicted anchor objects in the above order allow one to locate the target object progressively without supervision on the identities of anchor objects or exact relations between anchor/target objects. In addition, we present an order-aware warm-up training strategy, which augments referential orders for pre-training the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Multimodal Machine Learning Applications
