HandMeThat: Human-Robot Communication in Physical and Social Environments
Yanming Wan, Jiayuan Mao, Joshua B. Tenenbaum

TL;DR
HandMeThat is a comprehensive benchmark for evaluating human-robot communication in physical and social contexts, emphasizing understanding ambiguous instructions through physical and social cues, with initial models showing limited performance.
Contribution
The paper introduces HandMeThat, a new benchmark dataset for holistic evaluation of instruction understanding in physical and social environments, including a textual interface and baseline evaluations.
Findings
Baseline models perform poorly, indicating room for improvement.
The benchmark covers physical and social cues in human-robot interactions.
HandMeThat contains 10,000 episodes of human-robot interaction data.
Abstract
We introduce HandMeThat, a benchmark for a holistic evaluation of instruction understanding and following in physical and social environments. While previous datasets primarily focused on language grounding and planning, HandMeThat considers the resolution of human instructions with ambiguities based on the physical (object states and relations) and social (human actions and goals) information. HandMeThat contains 10,000 episodes of human-robot interactions. In each episode, the robot first observes a trajectory of human actions towards her internal goal. Next, the robot receives a human instruction and should take actions to accomplish the subgoal set through the instruction. In this paper, we present a textual interface for our benchmark, where the robot interacts with a virtual environment through textual commands. We evaluate several baseline models on HandMeThat, and show that both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
