Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding
Yang Liu, Daizong Liu, Wei Hu

TL;DR
This paper introduces a joint top-down and bottom-up framework for 3D visual grounding that combines efficient proposal generation with effective proposal refinement, achieving state-of-the-art results.
Contribution
It proposes a novel two-stage framework that integrates bottom-up proposal generation with top-down proposal refinement for improved 3D visual grounding.
Findings
Achieves state-of-the-art performance on ScanRefer benchmark.
Efficiently combines bottom-up and top-down methods.
Outperforms existing approaches in accuracy and speed.
Abstract
This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions. Existing methods fall into two categories: top-down and bottom-up methods. Top-down methods rely on a pre-trained 3D detector to generate and select the best bounding box, resulting in time-consuming processes. Bottom-up methods directly regress object bounding boxes with coarse-grained features, producing worse results. To combine their strengths while addressing their limitations, we propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency. Specifically, in the first stage, we propose a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector. Then, in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
