Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Ozan Unal; Christos Sakaridis; Suman Saha; Luc Van Gool

arXiv:2309.04561·cs.CV·July 17, 2024

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool

PDF

Open Access

TL;DR

This paper introduces ConcreteNet, a novel dense 3D visual grounding network with four modules that significantly improve instance segmentation in complex scenes, outperforming previous methods and winning a major challenge.

Contribution

ConcreteNet presents four innovative modules for dense 3D visual grounding, enhancing performance on challenging instances with distractors and view-dependent utterances.

Findings

01

Ranks 1st on ScanRefer benchmark

02

Wins ICCV 3rd Workshop challenge

03

Improves segmentation quality in complex scenes

Abstract

3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques