Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation

Tin Stribor Sohn; Maximilian Dillitzer; Jason J. Corso; Eric Sax

arXiv:2512.18028·cs.RO·December 23, 2025

Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

PDF

Open Access

TL;DR

Embodied4C introduces a comprehensive benchmark to evaluate vision-language models across diverse embodied platforms, emphasizing reasoning and generalization beyond platform-specific adaptation, revealing key challenges in spatial and temporal reasoning.

Contribution

The paper presents Embodied4C, a novel benchmark for assessing embodied reasoning in vision-language models across multiple physical platforms with diverse sensor configurations.

Findings

01

Cross-modal alignment and instruction tuning are crucial for embodied competence.

02

Spatial and temporal reasoning are primary bottlenecks.

03

Scale has less impact than alignment and tuning.

Abstract

Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization