Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation
Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

TL;DR
Embodied4C introduces a comprehensive benchmark to evaluate vision-language models across diverse embodied platforms, emphasizing reasoning and generalization beyond platform-specific adaptation, revealing key challenges in spatial and temporal reasoning.
Contribution
The paper presents Embodied4C, a novel benchmark for assessing embodied reasoning in vision-language models across multiple physical platforms with diverse sensor configurations.
Findings
Cross-modal alignment and instruction tuning are crucial for embodied competence.
Spatial and temporal reasoning are primary bottlenecks.
Scale has less impact than alignment and tuning.
Abstract
Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization
