Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, Gaurav, Sukhatme

TL;DR
Embodied BERT (EmBERT) is a transformer-based model designed for language-guided visual task completion in embodied agents, effectively handling multi-modal inputs and long-term dependencies, and introducing object navigation targets for improved training.
Contribution
We introduce Embodied BERT, the first transformer model capable of managing long-horizon, multi-modal visual and language inputs for embodied tasks, and incorporate object navigation targets into training.
Findings
Achieves competitive performance on the ALFRED benchmark.
First transformer-based model to handle ALFRED's long-horizon, dense histories.
First to use object-centric navigation targets in ALFRED training.
Abstract
Language-guided robots performing home and office tasks must navigate in and interact with the world. Grounding language instructions against visual observations and actions to take in an environment is an open challenge. We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for language-conditioned task completion. Additionally, we bridge the gap between successful object-centric navigation models used for non-interactive agents and the language-guided visual task completion benchmark, ALFRED, by introducing object navigation targets for EmBERT training. We achieve competitive performance on the ALFRED benchmark, and EmBERT marks the first transformer-based model to successfully handle the long-horizon, dense, multi-modal histories of ALFRED, and the first ALFRED model to utilize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · Dropout · Softmax · Attention Dropout · Dense Connections · Multi-Head Attention · Layer Normalization
