LanguageRefer: Spatial-Language Model for 3D Visual Grounding

Junha Roh; Karthik Desingh; Ali Farhadi; Dieter Fox

arXiv:2107.03438·cs.RO·November 8, 2021·22 cites

LanguageRefer: Spatial-Language Model for 3D Visual Grounding

Junha Roh, Karthik Desingh, Ali Farhadi, Dieter Fox

PDF

Open Access

TL;DR

LanguageRefer is a transformer-based model that combines spatial and language embeddings to accurately identify objects in 3D scenes from natural language references, advancing robotic understanding of human instructions.

Contribution

It introduces a novel spatial-language model for 3D visual grounding using a transformer architecture with spatial and language embeddings, demonstrating competitive performance.

Findings

01

Performs well on ReferIt3D datasets

02

Effective spatial reasoning decoupled from perception noise

03

Accurate in view-dependent utterance scenarios

Abstract

For robots to understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that comprehend referential language to identify common objects in real-world 3D scenes. In this paper, we introduce a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of point clouds with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model successfully identifies the target object from a set of potential candidates. Specifically, LanguageRefer uses a transformer-based architecture that combines spatial embedding from bounding boxes with fine-tuned language embeddings from DistilBert to predict the target object. We show that it performs competitively on visio-linguistic datasets proposed by ReferIt3D. Further, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection · Attention Dropout · Softmax · Dense Connections · WordPiece