YouRefIt: Embodied Reference Understanding with Language and Gesture

Yixin Chen; Qing Li; Deqian Kong; Yik Lun Kei; Song-Chun Zhu; Tao Gao,; Yixin Zhu; Siyuan Huang

arXiv:2109.03413·cs.CV·September 16, 2021

YouRefIt: Embodied Reference Understanding with Language and Gesture

Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Song-Chun Zhu, Tao Gao,, Yixin Zhu, Siyuan Huang

PDF

Open Access

TL;DR

This paper introduces YouRefIt, a new dataset and benchmarks for understanding embodied references using language and gesture in physical scenes, highlighting the importance of multimodal cues for human-robot interaction.

Contribution

It presents the first dataset and benchmarks for embodied reference understanding that combine language and gesture in real-world physical environments.

Findings

01

Gestural cues are as important as language cues in understanding references.

02

The dataset enables studying referential behavior and human communication in physical scenes.

03

Baseline experiments demonstrate the significance of multimodal cues in embodied reference understanding.

Abstract

We study the understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. Of note, this new visual task requires understanding multimodal cues with perspective-taking to identify which object is being referred to. To tackle this problem, we introduce YouRefIt, a new crowd-sourced dataset of embodied reference collected in various physical scenes; the dataset contains 4,195 unique reference clips in 432 indoor scenes. To the best of our knowledge, this is the first embodied reference dataset that allows us to study referring expressions in daily physical scenes to understand referential behavior, human communication, and human-robot interaction. We further devise two benchmarks for image-based and video-based embodied reference understanding. Comprehensive baselines and extensive experiments provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition