Multi-View Transformer for 3D Visual Grounding

Shijia Huang; Yilun Chen; Jiaya Jia; Liwei Wang

arXiv:2204.02174·cs.CV·April 6, 2022·1 cites

Multi-View Transformer for 3D Visual Grounding

Shijia Huang, Yilun Chen, Jiaya Jia, Liwei Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Multi-View Transformer that enhances 3D visual grounding by modeling multiple views simultaneously, leading to more robust and view-independent object localization in 3D scenes.

Contribution

The paper proposes a novel Multi-View Transformer approach that projects 3D scenes into a multi-view space, improving robustness and outperforming state-of-the-art methods in 3D visual grounding.

Findings

01

Outperforms all state-of-the-art methods on Nr3D and Sr3D datasets.

02

Achieves 11.2% and 7.1% improvements over best competitors.

03

Surpasses recent 2D-assisted methods by 5.9% and 6.6%.

Abstract

The 3D visual grounding task aims to ground a natural language description to the targeted object in a 3D scene, which is usually represented in 3D point clouds. Previous works studied visual grounding under specific views. The vision-language correspondence learned by this way can easily fail once the view changes. In this paper, we propose a Multi-View Transformer (MVT) for 3D visual grounding. We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together. The multi-view space enables the network to learn a more robust multi-modal representation for 3D visual grounding and eliminates the dependence on specific views. Extensive experiments show that our approach significantly outperforms all state-of-the-art methods. Specifically, on Nr3D and Sr3D datasets, our method outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sega-hsj/mvt-3dvg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Dropout · Layer Normalization · Label Smoothing · Softmax · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention