Embodied Referring Expression for Manipulation Question Answering in Interactive Environment
Qie Sima, Sinan Tan, Huaping Liu

TL;DR
This paper introduces REMQA, a new embodied task combining object manipulation and question answering, along with a benchmark dataset and a framework for evaluation in interactive environments.
Contribution
It proposes the REMQA task, creates a benchmark dataset in AI2-THOR, and develops a framework with 3D reconstruction and modular networks for evaluation.
Findings
Framework effectively evaluates REMQA performance
Demonstrates the feasibility of active manipulation for question answering
Provides a new benchmark for embodied AI tasks
Abstract
Embodied agents are expected to perform more complicated tasks in an interactive environment, with the progress of Embodied AI in recent years. Existing embodied tasks including Embodied Referring Expression (ERE) and other QA-form tasks mainly focuses on interaction in term of linguistic instruction. Therefore, enabling the agent to manipulate objects in the environment for exploration actively has become a challenging problem for the community. To solve this problem, We introduce a new embodied task: Remote Embodied Manipulation Question Answering (REMQA) to combine ERE with manipulation tasks. In the REMQA task, the agent needs to navigate to a remote position and perform manipulation with the target object to answer the question. We build a benchmark dataset for the REMQA task in the AI2-THOR simulator. To this end, a framework with 3D semantic reconstruction and modular network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
