Embodied Referring Expression Comprehension in Human-Robot Interaction
Md Mofijul Islam, Alexi Gladstone, Sujan Sarker, Ganesh Nanduru, Md Fahim, Keyan Du, Aman Chadha, and Tariq Iqbal

TL;DR
This paper introduces the Refer360 dataset and MuRes module to enhance robots' understanding of embodied human instructions through multimodal learning, addressing current limitations in datasets and model performance in diverse HRI settings.
Contribution
The paper presents a large-scale, diverse Refer360 dataset and a novel MuRes module that improves multimodal embodied referring expression comprehension in human-robot interaction.
Findings
MuRes improves model performance across multiple datasets.
Current models inadequately capture embodied interactions.
Refer360 serves as a new benchmark for embodied HRI understanding.
Abstract
As robots enter human workspaces, there is a crucial need for them to comprehend embodied human instructions, enabling intuitive and fluent human-robot interaction (HRI). However, accurate comprehension is challenging due to a lack of large-scale datasets that capture natural embodied interactions in diverse HRI settings. Existing datasets suffer from perspective bias, single-view collection, inadequate coverage of nonverbal gestures, and a predominant focus on indoor environments. To address these issues, we present the Refer360 dataset, a large-scale dataset of embodied verbal and nonverbal interactions collected across diverse viewpoints in both indoor and outdoor settings. Additionally, we introduce MuRes, a multimodal guided residual module designed to improve embodied referring expression comprehension. MuRes acts as an information bottleneck, extracting salient modality-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Speech and dialogue systems
