Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Anna Deichler; Jonas Beskow

arXiv:2510.22672·cs.CV·October 29, 2025

Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Anna Deichler, Jonas Beskow

PDF

1 Datasets

TL;DR

Look and Tell is a multimodal dataset capturing synchronized gaze, speech, and video from egocentric and exocentric views, enabling research on spatial grounding and situated dialogue in embodied agents.

Contribution

The paper introduces a novel dataset combining egocentric and exocentric perspectives with multimodal annotations for studying spatial grounding in communication.

Findings

01

Provides a new benchmark for multimodal grounding across perspectives

02

Includes extensive annotations of referential expressions

03

Facilitates research on embodied dialogue understanding

Abstract

We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

annadeichler/KTH-ARIA-referential
dataset· 268 dl
268 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.