Visually-Grounded Planning without Vision: Language Models Infer   Detailed Plans from High-level Instructions

Peter A. Jansen

arXiv:2009.14259·cs.CL·October 28, 2020

Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions

Peter A. Jansen

PDF

1 Repo

TL;DR

This paper demonstrates that language models can generate detailed multi-step plans for virtual tasks from high-level instructions without visual input, achieving significant success rates with minimal visual context.

Contribution

It introduces a method for translating natural language directives into detailed action plans without relying on visual data, showing promising results in virtual environment planning.

Findings

01

26% success in generating plans without visual input

02

58% success when including starting location information

03

Language models can serve as effective semantic planning modules

Abstract

The recently proposed ALFRED challenge task aims for a virtual robotic agent to complete complex multi-step everyday tasks in a virtual home environment from high-level natural language directives, such as "put a hot piece of bread on a plate". Currently, the best-performing models are able to complete less than 5% of these tasks successfully. In this work we focus on modeling the translation problem of converting natural language directives into detailed multi-step sequences of actions that accomplish those goals in the virtual environment. We empirically demonstrate that it is possible to generate gold multi-step plans from language directives alone without any visual input in 26% of unseen cases. When a small amount of visual information is incorporated, namely the starting location in the virtual environment, our best-performing GPT-2 model successfully generates gold command…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cognitiveailab/alfred-gpt2
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Cosine Annealing · Dense Connections · Layer Normalization · Byte Pair Encoding · Discriminative Fine-Tuning · Multi-Head Attention · Weight Decay · Dropout · Linear Warmup With Cosine Annealing