TL;DR
This paper introduces a novel two-stage interpretation method for embodied AI tasks that significantly improves performance on the ALFRED benchmark by integrating instruction understanding with visual context, especially using multiple views.
Contribution
It proposes a new two-stage instruction interpretation approach combined with hierarchical attention over multiple views, leading to substantial accuracy improvements in interactive instruction-following tasks.
Findings
Achieved 8.37% success rate with multiple views, doubling previous best.
Outperformed previous methods by a large margin on ALFRED.
Preliminary version won the ALFRED Challenge 2020.
Abstract
There is a growing interest in the community in making an embodied AI agent perform a complicated task while interacting with an environment following natural language directives. Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task, but achieved only very low accuracy. This paper proposes a new method, which outperforms the previous methods by a large margin. It is based on a combination of several new ideas. One is a two-stage interpretation of the provided instructions. The method first selects and interprets an instruction without using visual information, yielding a tentative action sequence prediction. It then integrates the prediction with the visual information etc., yielding the final prediction of an action and an object. As the object's class to interact is identified in the first stage, it can accurately select the correct object from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
