# Grounding Language Attributes to Objects using Bayesian Eigenobjects

**Authors:** Vanya Cohen, Benjamin Burchfiel, Thao Nguyen, Nakul Gopalan, Stefanie, Tellex, George Konidaris

arXiv: 1905.13153 · 2019-08-05

## TL;DR

This paper presents a system that disambiguates object instances using natural language and depth images, capable of generalizing to new objects and viewpoints with minimal labeled data, and demonstrates robotic application.

## Contribution

It introduces a novel approach that decouples 3D shape and language representations, enabling language grounding to novel objects with limited labeled data and viewpoint transfer.

## Key findings

- Successfully disambiguates objects using natural language and depth images.
- Generalizes to unseen objects and viewpoints with minimal labeled data.
- Enables a robot to pick objects based on natural language descriptions.

## Abstract

We develop a system to disambiguate object instances within the same class based on simple physical descriptions. The system takes as input a natural language phrase and a depth image containing a segmented object and predicts how similar the observed object is to the object described by the phrase. Our system is designed to learn from only a small amount of human-labeled language data and generalize to viewpoints not represented in the language-annotated depth image training set. By decoupling 3D shape representation from language representation, this method is able to ground language to novel objects using a small amount of language-annotated depth-data and a larger corpus of unlabeled 3D object meshes, even when these objects are partially observed from unusual viewpoints. Our system is able to disambiguate between novel objects, observed via depth images, based on natural language descriptions. Our method also enables view-point transfer; trained on human-annotated data on a small set of depth images captured from frontal viewpoints, our system successfully predicted object attributes from rear views despite having no such depth images in its training set. Finally, we demonstrate our approach on a Baxter robot, enabling it to pick specific objects based on human-provided natural language descriptions.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.13153/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1905.13153/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/1905.13153/full.md

---
Source: https://tomesphere.com/paper/1905.13153