YouRefIt: Embodied Reference Understanding with Language and Gesture
Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Song-Chun Zhu, Tao Gao,, Yixin Zhu, Siyuan Huang

TL;DR
This paper introduces YouRefIt, a new dataset and benchmarks for understanding embodied references using language and gesture in physical scenes, highlighting the importance of multimodal cues for human-robot interaction.
Contribution
It presents the first dataset and benchmarks for embodied reference understanding that combine language and gesture in real-world physical environments.
Findings
Gestural cues are as important as language cues in understanding references.
The dataset enables studying referential behavior and human communication in physical scenes.
Baseline experiments demonstrate the significance of multimodal cues in embodied reference understanding.
Abstract
We study the understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. Of note, this new visual task requires understanding multimodal cues with perspective-taking to identify which object is being referred to. To tackle this problem, we introduce YouRefIt, a new crowd-sourced dataset of embodied reference collected in various physical scenes; the dataset contains 4,195 unique reference clips in 432 indoor scenes. To the best of our knowledge, this is the first embodied reference dataset that allows us to study referring expressions in daily physical scenes to understand referential behavior, human communication, and human-robot interaction. We further devise two benchmarks for image-based and video-based embodied reference understanding. Comprehensive baselines and extensive experiments provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition
