ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday   Tasks

Mohit Shridhar; Jesse Thomason; Daniel Gordon; Yonatan Bisk; Winson; Han; Roozbeh Mottaghi; Luke Zettlemoyer; and Dieter Fox

arXiv:1912.01734·cs.CV·April 1, 2020

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson, Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox

PDF

5 Repos 1 Datasets 1 Video

TL;DR

ALFRED is a comprehensive benchmark designed to evaluate and advance models that interpret natural language instructions into sequences of actions within realistic household environments, addressing the complexity of everyday tasks.

Contribution

The paper introduces ALFRED, a new benchmark with detailed, realistic tasks and a large dataset of demonstrations, to improve grounded language understanding in household scenarios.

Findings

01

Baseline models perform poorly on ALFRED, indicating room for improvement.

02

ALFRED's tasks are more complex than existing datasets, challenging current models.

03

The benchmark bridges the gap between research and real-world household task applications.

Abstract

We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

awawa-agi/alfworld-raw
dataset· 50 dl
50 dl

Videos

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks· youtube