Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding
Cheng Shi, Sibei Yang

TL;DR
This paper introduces REP, a novel method for embodied reference understanding that models spatial and visual perspective-taking through view rotation and relation reasoning, significantly improving accuracy in locating objects based on language and gesture.
Contribution
The paper proposes a new approach combining view rotation and relation reasoning to better model perspective-taking in embodied reference understanding.
Findings
REP outperforms existing methods by +5.22% accuracy on YouRefIt.
View rotation effectively aligns egocentric views with sender perspectives.
Relation reasoning enhances multi-modal understanding of sender-object relations.
Abstract
Embodied Reference Understanding studies the reference understanding in an embodied fashion, where a receiver is required to locate a target object referred to by both language and gesture of the sender in a shared physical environment. Its main challenge lies in how to make the receiver with the egocentric view access spatial and visual information relative to the sender to judge how objects are oriented around and seen from the sender, i.e., spatial and visual perspective-taking. In this paper, we propose a REasoning from your Perspective (REP) method to tackle the challenge by modeling relations between the receiver and the sender and the sender and the objects via the proposed novel view rotation and relation reasoning. Specifically, view rotation first rotates the receiver to the position of the sender by constructing an embodied 3D coordinate system with the position of the sender…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems
