Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly   Supervised 3D Visual Grounding

Zehan Wang; Haifeng Huang; Yang Zhao; Linjun Li; Xize Cheng; Yichen; Zhu; Aoxiong Yin; Zhou Zhao

arXiv:2307.09267·cs.CV·July 19, 2023·1 cites

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen, Zhu, Aoxiong Yin, Zhou Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a weakly supervised 3D visual grounding approach that leverages coarse scene-sentence annotations and a novel semantic matching model to improve accuracy and reduce inference costs.

Contribution

It proposes a coarse-to-fine semantic matching model for weakly supervised 3D visual grounding and distills this knowledge into existing models to enhance performance.

Findings

01

Effective on ScanRefer, Nr3D, and Sr3D datasets

02

Reduces inference costs while improving accuracy

03

Leverages coarse annotations for training

Abstract

3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair annotations in 3D point clouds, which are both time-consuming and expensive. To address the problem that fine-grained annotated data is difficult to obtain, we propose to leverage weakly supervised annotations to learn the 3D visual grounding model, i.e., only coarse scene-sentence correspondences are used to learn object-sentence links. To accomplish this, we design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner. Specifically, we first extract object proposals and coarsely select the top-K candidates based on feature and class similarity matrices. Next, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZzZZCHS/WS-3DVG
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques