Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

Joel Currie; Gioele Migno; Enrico Piacenti; Maria Elena Giannaccini; Patric Bach; Davide De Tommaso; Agnieszka Wykowska

arXiv:2505.14366·cs.AI·May 21, 2025

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

Joel Currie, Gioele Migno, Enrico Piacenti, Maria Elena Giannaccini, Patric Bach, Davide De Tommaso, Agnieszka Wykowska

PDF

Open Access

TL;DR

This paper introduces a synthetic dataset for training vision-language models to perform spatial reasoning, specifically focusing on inferring object distances along the Z-axis, as a step toward embodied cognition in robots.

Contribution

It presents a novel synthetic dataset generated in NVIDIA Omniverse for supervised learning of spatial reasoning tasks relevant to embodied AI.

Findings

01

Dataset enables supervised learning of spatial reasoning

02

Focus on inferring Z-axis distance as a foundational skill

03

Publicly available dataset supports further research

Abstract

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Reinforcement Learning in Robotics · Artificial Intelligence in Games

MethodsFocus