HELPER-X: A Unified Instructable Embodied Agent to Tackle Four   Interactive Vision-Language Domains with Memory-Augmented Language Models

Gabriel Sarch; Sahil Somani; Raghav Kapoor; Michael J. Tarr; Katerina; Fragkiadaki

arXiv:2404.19065·cs.AI·May 1, 2024

HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

Gabriel Sarch, Sahil Somani, Raghav Kapoor, Michael J. Tarr, Katerina, Fragkiadaki

PDF

Open Access

TL;DR

HELPER-X is a versatile embodied agent that leverages memory-augmented language models to excel across four interactive vision-language tasks without in-domain training.

Contribution

The paper introduces HELPER-X, an expanded memory-augmented agent capable of handling multiple interactive vision-language domains with state-of-the-art few-shot performance.

Findings

01

Achieves state-of-the-art results on four benchmarks

02

Operates effectively without in-domain training

03

Handles diverse tasks using shared memory and APIs

Abstract

Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. In this technical report, we extend the capabilities of HELPER, by expanding its memory with a wider array of examples and prompts, and by integrating additional APIs for asking questions. This simple expansion of HELPER into a shared memory enables the agent to work across the domains of executing plans from dialogue, natural language instruction following, active question asking, and commonsense room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the Tidy Task.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications