LanguageRefer: Spatial-Language Model for 3D Visual Grounding
Junha Roh, Karthik Desingh, Ali Farhadi, Dieter Fox

TL;DR
LanguageRefer is a transformer-based model that combines spatial and language embeddings to accurately identify objects in 3D scenes from natural language references, advancing robotic understanding of human instructions.
Contribution
It introduces a novel spatial-language model for 3D visual grounding using a transformer architecture with spatial and language embeddings, demonstrating competitive performance.
Findings
Performs well on ReferIt3D datasets
Effective spatial reasoning decoupled from perception noise
Accurate in view-dependent utterance scenarios
Abstract
For robots to understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that comprehend referential language to identify common objects in real-world 3D scenes. In this paper, we introduce a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of point clouds with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model successfully identifies the target object from a set of potential candidates. Specifically, LanguageRefer uses a transformer-based architecture that combines spatial embedding from bounding boxes with fine-tuned language embeddings from DistilBert to predict the target object. We show that it performs competitively on visio-linguistic datasets proposed by ReferIt3D. Further, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection · Attention Dropout · Softmax · Dense Connections · WordPiece
