Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding

Yang Liu; Daizong Liu; Wei Hu

arXiv:2410.15615·cs.CV·October 22, 2024

Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding

Yang Liu, Daizong Liu, Wei Hu

PDF

Open Access

TL;DR

This paper introduces a joint top-down and bottom-up framework for 3D visual grounding that combines efficient proposal generation with effective proposal refinement, achieving state-of-the-art results.

Contribution

It proposes a novel two-stage framework that integrates bottom-up proposal generation with top-down proposal refinement for improved 3D visual grounding.

Findings

01

Achieves state-of-the-art performance on ScanRefer benchmark.

02

Efficiently combines bottom-up and top-down methods.

03

Outperforms existing approaches in accuracy and speed.

Abstract

This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions. Existing methods fall into two categories: top-down and bottom-up methods. Top-down methods rely on a pre-trained 3D detector to generate and select the best bounding box, resulting in time-consuming processes. Bottom-up methods directly regress object bounding boxes with coarse-grained features, producing worse results. To combine their strengths while addressing their limitations, we propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency. Specifically, in the first stage, we propose a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector. Then, in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques