Embodied Referring Expression for Manipulation Question Answering in   Interactive Environment

Qie Sima; Sinan Tan; Huaping Liu

arXiv:2210.02709·cs.RO·October 18, 2023

Embodied Referring Expression for Manipulation Question Answering in Interactive Environment

Qie Sima, Sinan Tan, Huaping Liu

PDF

Open Access

TL;DR

This paper introduces REMQA, a new embodied task combining object manipulation and question answering, along with a benchmark dataset and a framework for evaluation in interactive environments.

Contribution

It proposes the REMQA task, creates a benchmark dataset in AI2-THOR, and develops a framework with 3D reconstruction and modular networks for evaluation.

Findings

01

Framework effectively evaluates REMQA performance

02

Demonstrates the feasibility of active manipulation for question answering

03

Provides a new benchmark for embodied AI tasks

Abstract

Embodied agents are expected to perform more complicated tasks in an interactive environment, with the progress of Embodied AI in recent years. Existing embodied tasks including Embodied Referring Expression (ERE) and other QA-form tasks mainly focuses on interaction in term of linguistic instruction. Therefore, enabling the agent to manipulate objects in the environment for exploration actively has become a challenging problem for the community. To solve this problem, We introduce a new embodied task: Remote Embodied Manipulation Question Answering (REMQA) to combine ERE with manipulation tasks. In the REMQA task, the agent needs to navigate to a remote position and perform manipulation with the target object to answer the question. We build a benchmark dataset for the REMQA task in the AI2-THOR simulator. To this end, a framework with 3D semantic reconstruction and modular network…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques