Embodied BERT: A Transformer Model for Embodied, Language-guided Visual   Task Completion

Alessandro Suglia; Qiaozi Gao; Jesse Thomason; Govind Thattai; Gaurav; Sukhatme

arXiv:2108.04927·cs.CV·November 5, 2021·5 cites

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, Gaurav, Sukhatme

PDF

Open Access 1 Repo

TL;DR

Embodied BERT (EmBERT) is a transformer-based model designed for language-guided visual task completion in embodied agents, effectively handling multi-modal inputs and long-term dependencies, and introducing object navigation targets for improved training.

Contribution

We introduce Embodied BERT, the first transformer model capable of managing long-horizon, multi-modal visual and language inputs for embodied tasks, and incorporate object navigation targets into training.

Findings

01

Achieves competitive performance on the ALFRED benchmark.

02

First transformer-based model to handle ALFRED's long-horizon, dense histories.

03

First to use object-centric navigation targets in ALFRED training.

Abstract

Language-guided robots performing home and office tasks must navigate in and interact with the world. Grounding language instructions against visual observations and actions to take in an environment is an open challenge. We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for language-conditioned task completion. Additionally, we bridge the gap between successful object-centric navigation models used for non-interactive agents and the language-guided visual task completion benchmark, ALFRED, by introducing object navigation targets for EmBERT training. We achieve competitive performance on the ALFRED benchmark, and EmBERT marks the first transformer-based model to successfully handle the long-horizon, dense, multi-modal histories of ALFRED, and the first ALFRED model to utilize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-research/embert
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · Dropout · Softmax · Attention Dropout · Dense Connections · Multi-Head Attention · Layer Normalization