Look Wide and Interpret Twice: Improving Performance on Interactive   Instruction-following Tasks

Van-Quang Nguyen; Masanori Suganuma; Takayuki Okatani

arXiv:2106.00596·cs.CV·June 8, 2021

Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks

Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani

PDF

1 Repo

TL;DR

This paper introduces a novel two-stage interpretation method for embodied AI tasks that significantly improves performance on the ALFRED benchmark by integrating instruction understanding with visual context, especially using multiple views.

Contribution

It proposes a new two-stage instruction interpretation approach combined with hierarchical attention over multiple views, leading to substantial accuracy improvements in interactive instruction-following tasks.

Findings

01

Achieved 8.37% success rate with multiple views, doubling previous best.

02

Outperformed previous methods by a large margin on ALFRED.

03

Preliminary version won the ALFRED Challenge 2020.

Abstract

There is a growing interest in the community in making an embodied AI agent perform a complicated task while interacting with an environment following natural language directives. Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task, but achieved only very low accuracy. This paper proposes a new method, which outperforms the previous methods by a large margin. It is based on a combination of several new ideas. One is a two-stage interpretation of the provided instructions. The method first selects and interprets an instruction without using visual information, yielding a tentative action sequence prediction. It then integrates the prediction with the visual information etc., yielding the final prediction of an action and an object. As the object's class to interact is identified in the first stage, it can accurately select the correct object from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

davidnvq/lwit-alfred
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.